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PREFACE 


acounts of the use and evaluation of psychological tests are 
ined in a number of reports previously issued by the Industrial 
h Research Board, the earliest being a review of the literature 
cational guidance by Muscio (No. 12), which may be regarded 
‚ ancestor of the present publication. In these reports, the 
discussed have been for abilities, for skills, or for special 
nal qualities, e.g. those involved in “accident proneness + ; 
stances, may be mentioned Nos. 31 and 53 on the use of 
rmance tests of intelligence in vocational guidance, Nos. 59 
38 on tests for accident proneness, and No. 64 dealing with 
ional tests of dexterity. 

he present report surveys critically tests of à different kind, 
attitude tests, rating scales, and personality questionnaires. 
value of these tests is admittedly less well-established than the 
‚ of tests of abilities; yet the qualities they try to test are so 
[tant as to be worthy of every effort to put them on а 
tific basis. 
hat the Board have realised the importance to industry of 
sing and endeavouring to estimate the influence on contentment 
ficiency of «he mental attitudes and emotional traits of workers, 
bwn in several of the reports already issued by them, e.g. 
tige and boredom in repetitive work (Nos. 56 and 77), and 


nervous temperament (No. 61). The present report describes 


e 
t such personal characteristics 


ttempts now being made to subjec 
curate measurement. А 
he field covered is wide ; much of the work described, for 
ce, has been done in America. In this country, also, a wide 
st is taken in these problems, not only from their social and 
itional aspects, but owing to their importance in industry, 
» more and more attention is being paid to discovering and 
Ing the temperamental qualities and abilities that influence an 
idual's adjustment to his work, and to exploring the а Е 
hployees to their working conditions—and to their emp ою 
he Board believe that this report, whilst of special interes 3 
kigators doing field work in psychology and to students o 
rial and social psychological problems, is nevertheless relevant 
stance to the interests and pr Board share 
industrialists. There exists, so far as they know, no collected 
ture which covers the ground to the same extent both 


graphically and critically. 


Tay, 1938 
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INTRODUCTION 


The widespread realization of the importance of “ the human 
factor" is а striking feature of present-day civilization. More and 
more attention is being paid to the instincts, sentiments, complexes 
and other psychological characteristics of human beings. In industry 
we attempt to discover the main temperamental qualities and 
abilities that influence an individual’s adjustment to his job, and we 
explore the attitudes of employees to working conditions or to their 
employers. In education we try to guide parents and teachers as 
to the best means of dealing with children at home and at school, 
and treat the maladjusted and the delinquent at Psychological 
Clinics. Social psychologists even hope some day to be able to 
alleviate the world’s political and economic ills by means of their 
studies of human nature. 

Up till fairly recently the whole of our knowledge of people and 
their motives’ has been gathered by, haphazard and unscientific 
methods, such as everyday observation, subjective inference and 
intuition. We are only beginning to formulate the general principles 
which should govern the collection of data and the deduction of 
conclusions, and are still more backward in the establishment of 
systematic laws which validly describe human thoughts and conduct. 
The number of impartial investigators in the social sciences is 
indeed rapidly, increasing, yet their methods of attack on their 
problems are still highly fallible, when contrasted with the methods 
used for research in the physical sciences. 

The claim that a scientific psychology or sociology can only be 
obtained by applying to human beings the experimental and logical 
techniques of the more advanced sciences would appear, to the 
present writer, fallacious. Nevertheless there is a tremendous 
need of controlled experimentation and exact measurement in our 
field. Such generalizations as we can make about human beings 
are still derived far too largely from uncontrolled observations and 
from interviews or discussions, or occasionally from the answers to 
widely distributed questionnaires. Such methods are probably 
essential in the present state of our knowledge for obtaining a general 
conspectus of any problem. Indeed it would be futile to attempt 
a psychological or sociological experiment in a factory, school, or 
home, unless we first possessed an intimate personal knowledge of 
the people and of the relevant conditions. Such persgnal knowledge 
is of course unscientific, although its trustwortkaiess may be greatly 
improved by adherence to such principles of interviewing and of 
case-study writing as are now available. But having obtained this 

eneral conspectus, we should certainly attempt to apply more 
refined techniques. It is with the description of one particular 
branch of these techniques that the present Report 1s concerned. 

The specialized methods of the scientific psychologist are far too 
numerous to treat in a single Report, and we shall omit altogether 
what is perhaps the biggest class, namely the recording and measure- 
ment under controlled conditions of people’s actions, or of the 
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products of their actions, and confine ourselves solely to verbal 
behaviour—either oral or, more usually, written. This will be 
further restricted to types of verbal behaviour which are susceptible 
to quantitative treatment, and so will exclude the ordinary interview, 
the clinical techniques of the medical psychologist, together with 
diaries or autobiographies and the like. Since there are already 
many excellent accounts of verbal tests of intelligence and of special 
aptitudes, we will deal only with the measurement of emotional 
characteristics. A more exact definition о: 


techniques, where peopl 


T ted by the following instancé*. A question 
Оп pacifism was answered by 22,627 students in 70 American 
cent. checked the answer that they would not 

Participate in any war, 33 per cent. agreed that they would participate 
i their country was invaded, and 28 per cent. 

join in any war declared by the U.S, This 
f a measure of a 
Л pe. The question 

might however be expanded, or so combined with other questions 


would yield so 


the Scaling of attitudes, etc. in е 
ә iple facto. 690УаЈепе units ($& 11-14, 45 53 87. 
187), and with multiple factor analysis (8s 88, 60-70, 112-119, 144- 


* For a general outline of questionnaire techniques, cf. Vernon (1938). 
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152, 221-224, 285-287)}, the reason being that few general surveys of 
these techniques have been published so far, and none are readily 
available in this country. In contrast there are already numerous 
accounts of most of the tests and their chief results, e.g. Symonds 
(1931, 1934), Fryer (1931), Droba (1932), Jones and Burks (1936), 
Allport (1937), Murphy (1937), etc. 

Although the Report concentrates on methods, yet it must not 
pretend to be a detailed practical handbook on how to apply these 
methods. Rather it is intended as a guide for showing what methods 
are available, where fuller descriptions of them may be obtained, 
and in particular what chief precautions are needed in using them. 

An attempt has been made to draw on the Work of British 
investigators wherever possible, and most of the illustrations in 

' Chapter II are taken from such work. In subsequent chapters 
however the enormous majority of the studies cited are American. 
There cannot be the slightest doubt that the development of these 
methods is far more advanced in the United States than in any other 
country. References to the studies of German, French and other: 
foreign psychologists will be conspicuous by their absence. This is 
not because of any lack of valuable contributions to the psychology 
of personality and of society in these countries—we need only 
mention the names of Freud, Jung, Adler, Spranger, Ch. Bühler, 
Lahy, Mira, Luria, etc. to belie the view—but solely because their 
employment of the quantitative techniques with which we are 
concerned is negligible. As the writer has shown elsewhere (1933b), 
one of the most pressing néeds of contemporary psychology is the 
reconciliation and integration of the approaches characteristic of 
Continental and American investigators. The interpretative insight 
of the former should help to correct the blind empiricism of the 
Jatter ; and the uncontrolled subjectivity of the one should be checked 
by the scientific discipline of the other. In point of fact considerable 
advances have been made in this direction during the past five years, 
and there is reason to hope for still further interpenetration in the 
future. 

No one can hope to treat this field entirely without prejudice ; 
his views on the psychology of personality will inevitably affect 
his judgments, particularly when, as in this Report, he attempts 
to go beyond mere description of the work, and to provide a general 
assessment of its significance. It is anticipated, moreover, that while 
some will disagree with the writer's criticisms of ratings, personality 
questionnaires, purely empirical tests, etc., others will regard the 
methods as still more fallible, and therefore more useless in practice, 
than he has done. To the former he would point out how meagre 

he actual achievements of scientific methods in the study of 

when contrasted with the amazing extent and variety 
fic knowledge of, and ability to control, human 


are t 
personality, 
of our non-scienti 


ў These numbers refer to the sub-sections or paragraphs into which 
Chapters I-VII have been divided. References to the bibliography are given 
by means of dates. 
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beings in everyday life. The influence of experimental psychology 
and mental testing in the realm of educational and industrial 
abilities is immense ; but not in the realm of emotions and motives. 
There the main advances have come so far from clinical and theoret- 
ical rather than from experimental psychology. The latter group 
of objectors should be reminded once more that these methods are 
still in their infancy, and cannot be expected to be able to provide 
solutions to any and every social-psychological problem. They do 
not supplant, but supplement and check “ common-sense " methods 
such as observation, interviews and ordinary questionnaires ; and 
when used with caution they can supply far more accurate and 
objective information upon a great variety of special problems 
than can the subjective and biased generalizations of which both 
“common-sense ” and clinical psychology so frequently consist. 


П.—СКООР SURVEYS OF ATTITUDES AND INTERESTS 
A. TECHNIQUES FOR COLLECTING GROUP OPINIONS 


1. We Will begin with one of the simplest types of scale, which 
will serve to illustrate many of the techniques needed in the more 
elaborate tests ; namely, scales for assessing the relative preferences 
of a group of persons towards a set of issues. For example, Thurstone 
(1928), Bogardus (1933) and others have studied nationality prefer- 
ences, ie. the relative popularity of English, Scottish, German, 
Turkish, Negro and other peoples among American testees, Many 
studies have been made of the relative popularity of school subjects 
among pupils, two recent ones carried out in this country being 

£ and Shakespeare’s (1936). Valentine (1934) and 


others have investigated the relative importance of the various 
motives which influ 


У ence teachers in their choice of a career. A good 
Instance in the field of industrial psychology is Wyatt and Langdon’s 
(1937) recent study of the chief causes of satisfaction or dissatis- 
faction among factory workers, and of the types of employment 
Most popular among women operatives. Newspaper competitions 
which determine the most popular film stars or novels, etc. of their 
readers, might also be mentioned. 

For obtaining these va 
may be distinguished ; si 
comparisons; 


tious preferences, four main tech 


x с І niques 
mple voting, rating, ranking, 


and paired 


et us say, pick out six Sources of dissatisfaction 


—M—! X" 
— » y————————— ZZ 
“эчу ' “© fama 


سے 


mE = 2. 
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will have a much greater influence on the final result than those 
puo only choose one source. This difficulty may be overcome. 
e if the former judges' votes are each given one sixth of 
the weight of the latters' ; and if the votes of other judges who 
give other numbers are similarly weighted. Mm 
. his technique is the simplest from ће point of view of the 
judge, but the final result is unlikely to be reliable unless a very 
large number of judges take part. For example, the relative 
importance of various causes of dissatisfaction found among one 
hundred operatives might be distinctly different from their relative 


importance among another hundred. 


(ii) Rating M. ethods 

3. Each judge is instructed to give, say, 4 marks to those items 
which he regards as important, 3 to items which are somewhat less 
important, and so оп down to 0 for items which are to him of no 
importance: Eaglesham (1937) used this technique in finding the 
opinions of teachers on the relative practicability of a set of twenty- 
eight educational ideals. It has often been employed in industrial 
psychological investigations, such as rating the relative effectiveness 
of advertisements. Alternatively, plus and minus judgments may 
be requested, eg. +2 for strong likes, 4-1 for moderate likes, 
0 for indifference, —2 for strong dislikes ; or letters (A to E) may be 
substituted. Experiments by Symonds (1924) and others have 
shown that raters can fairly readily distinguish five or even seven 
different grades. А larger number than seven is unwise, because 
raters can hardly discriminate so many steps consistently, unless 
they are specially trained. А smaller number (e.g. 25507 —, or 
А, B, C) is rather ‘wasteful of the raters’ discriminatory powers, just 
as is the voting technique. It therefore leads to decreased reliability 
of the ratings. 


To combine t 
preference, the ratings giv! 


s and determine the group 
en to each item are summed and divided 


by the number of raters. If some of the raters have omitted some 
of the items, this averaging procedure will allow for it. Adjustments 
may, however, have to be made for two very common errors. 

4. Errors in ratings and their correction.—First, some raters may 
er average rating than others ; e.g. A may chiefly 
gi 1’s and 0’s.. Thus, when 
Thorndike (1935) asked a number of persons” to estimate their 
degree of interest in, or liking for, various topics, оп a +95 to — 5 
scale, he found that the average rating used by different judges 


ranged from + 4-2 to + 0:4. ; ў 
Secondly, some raters may use more extreme ratings, either high 
or low, than others. In an experiment where the writer presente 

Eaglesham’s list offeducational ideals to 109 student teachers, he found 
that one student used 7 per cent. of 4's and 0's, another 59 per cent. 
instead of the desired 40 per cent. The judges were specifically 


instructed to employ approximately equal proportions of 4's, 3S, 


he results of all the rater: 


adopt a much high 
ive 45, 3’s and 2’s, В chiefly gives 2’s, 
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2s, Is and 0’s ; but for this warning, the variation in dispersion of 
ratings might have been considerably greater.* 

Now the first error is quite immaterial so long as every rater rates 
every item. But if, say, rater A gives high marks to items Nos. 1-23, 
and omits Nos. 24-28, while rater B gives low marks to Nos. 6-28 
and omits Nos. 1-5; then in the final averages, items 1-5 will be 
too high, 24-28 too low. Hence if large errors of this type occur, 
the average of all the ratings given by each rater should be 
determined; these averages should then be made identical by 
applying an appropriate correction, before the ratings for the 
separate items are summed : e.g. if A employs an average rating of 24 
instead of 2, then”halfa mark may be subtracted from all his judgments. 

The second error is less obvious but much more serious. If 
it is not corrected, rater A (with a large dispersion of ratings) will 
have a greater influence on the final averages than rater B (with a 


small dispersion). The appropriate adjustment is to express each . 
rating as a “ sigma score," i.e. as a deviation from the rater’s озуп. 


mean, divided by the standard deviation of all his ratings ; and only 
then to sum the results for the items. This procedure is simple, 
but laborious ; and it may have very little effect on the final results 
if the number of raters is large, and their variations in dispersion 
fairly small (cf. Conrad, (1933)). The investigator should perhaps 
first calculate the reliability of his results (cf. §15) without making 
these corrections, and only if the reliability 15 low eed he attempt 
to see whether the application of the corrections will improve it. 

A more regular distribution of ratings may be secured if the 
raters are instructed as to the proportions of items which they 
should assign to each step on the rating scale. It is usually assumed 
that the distribution should approximate to а “ normal ” or binomial 
type. For example, if the unit of the scale is taken to be the standard 
deviation, approximately 7 per cent. of the items would recejy, 
4 marks, 24 per cent.—3, 38 per cent.—2, 24 per cent.—] mum 
7 per cent.—O}. The appropriate proportions for other rat 


ings 
with different numbers of steps are tabulated by Symonds (1995 
and Guilford (1936). | 


* Precisely the same errors occur in the marking of examination questi 
when the marks given by different examiners, or the marks from qj Ons, 
questions have to be combined. Both the average mark, and the qj ifferent. 
of marks given by each examiner to each question are liable to . Ispersion 
variations, and shot’ be adjusted (cf. Hartog and Rhodes, 1936 candalous 


+ When the results of several raters are to be combined, thej. 


ratings will show a much narrower distribution than that EA e Averaged 
ratings. The writer would suggest, therefore, that raters be a, © original 
distribution with a large standard deviation. For instance, if use a 


adopt an 18, 20, 24, 20, 18 per cent. distribution (SD. = Lesa? Taters each 
ratings yield an average inter-correlation of --0-30, then ), and if their 
ratings will yield the desized 7, 24, 38, 24, 7 per cent. distributio, Me averaged 
Again, if the raters are likely to inter-correlate to +0-66 he (S.D. = 1:0) 
22, 12 per cent. distribution (S-:D.— 1-25) should be used. T, zen а 12, 22, 32, 
correlation can be estimated, the appropriate distributio € average inter. 
calculated from Kelley's formula No. 171 (cf. Kelley, 1923) Y be readily 


j 


C 
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5. Obtaining constant standards among raters.—Another source 
of inaccuracy in this technique, which cannot easily be corrected, 
is that the rater’s standards may vary indiscriminately throughout 
the rating process ; e.g. he may give rather high marks to all the 
items at the beginning and later become stricter. (This may also 
occur in examination marking (cf. footnote p. 6.) То reduce this 
we may adopt either of two procedures which are common in rating 
human character, namely the “ Man to Man " scale ($ 84) or the 
“Graphic " scale ($ 85). Тһе rater should be instructed to look 
through the items and pick out five of them which seem to him 
representative of the five grades which he intends to employ. There- 
after he should compare each item with these samples and give it 
the same mark as the sample to which it most nearly corresponds. 
Alternatively the investigator should supply the rater with a standard 
set of samples. This is commonly done in educational scales for 
grading the quality of children's handwriting or drawings, etc. 
Generally however it is impossible to select а set of samples upon. 
which all the raters will agree. 

6. Another way of reducing the difficulty and obtaining common 
standards among different raters may be illustrated by Bogardus's 
study of nationality preferences (1933). Instead of telling his 
judges to give 7 marks to those nations they liked most, 1 mark to 
those they liked least, or to use some other numerical scale, he 
instructed them to check one of the following statements with 


reference to members of eagh nation : 


Would marry dibs 
Would have as regular friends 
Would work beside in an office е : 
Would have several families in my neighbourhood 


Would have merely as speaking acquaintances 


Would have live outside my neighbourhood 
Would have live outside my country ( 
The adoption of such a method seems likely to make the ES more 
concrete and more reliable than any numerical пене; . Yet the 
ratings can of course later be translated into numbers by the 
experimenter. 
7. The great advantage 
cability to large numbers © 


of the rating technique is its easy appli- 
f items. The two оона techniques 
various reasons, more accurate than rating ; . ut they can 
ae applied to relatively small numbers of-scems, No. (iii) to 
saa 95 or less, №. (iv) to about 12 or less. 
* (ii) Ranking 

; 1 ference the i 
udge arranges 1m order of рге series of 
= Неге ЕЕ school subjects, advertisements, etc.). It is 
pus e to present the items on а set of cards so that they may be 
Si manipulated and re-arranged: until the judge is satisfied with 
e Met То be given а printed list and to write 1, 2, 3, etc., in 
ee of ‘choice is less satisfactory, not merely because it is more 
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confusing, but also because there is likely to be a considerable 
"space erfor.” Symonds (1936) has shown, both with children 
and with adults, that there is a tendency to rank items near the 
head of the list too high, and items near the foot too low. 

When every judge ranks all the items, the combination of the 
results from all judges is simple. The average, or better, the median 
rank obtained by each item is determined, If desired, these final 
averages may be re-ranked. 

This procedure does not, like rating, involve giving absolute 
marks to any of the items, but depends only on comparison of their 
relative desirability. Hence the two common errors in ratings 
($ 4) are avoided ; both the mean and standard deviation of each 
judge’s rankings are identical. Moreover, the judges’ standards 
are not likely to vary during the process of ranking because they 
are, or should be, comparing every item with every other before 
deciding on their positions. 

8. Distribution of rank orders.—We should note, however, that a 
rank order assumes a highly artificial distribution of items. With, 
say, 20 item, it conveys the impression that the difference between 
items 1 and 2, or 19 and 20, is the same size as the difference between 
Nos. 10 and 11 ; whereas it is highly probable that the latter items 
are closer together than the former. This is borne out by the 
difficulty which judges find in ranking the middle items to their 
satisfaction ; they can usually differentiate much more readily at 
the extremes. Although we certainly cannot claim that all prefer- 
ences and attitudes are distributed in the total population ас 
to the normal curve (in fact Thurstone and Chave (1929) 
criticize such an assumption), yet a normal distribution is 
more justifiable that the rectangular distribution yielded by ranks, 
For example, a typical group of school children might be very fond 
of a few school subjects, and very much dislike a few subjects 
but be indifferent or undecided about the majority of subjects. 
and a group of operatives might strongly object to certain factory, 
conditions, greatly appreciate others, but possess no attitude either 
way towards a large number of conditions. In both these instance 
then, the distribution should approximate to normality. Hen 
there is much to be said for translating the results of a га: ing st SS 
into sigma units, which do assume a normal distribution ny 
procedure may be facilitated by Symond’s (1931) or Hull’s (19 ү 
tables. With 20 items, the sigma scores of items rankeg . (1928) 

ist nd БЕ IO Lith етте 20th 
would be : 
+ 2-06 + 1-45 + 0-76 + 0-06 — 0:06 — 0.76 _ 2 

It will be seen that these scores do allow for the greater di *06. 
between Ist and 2nd than between 10th and llth places ference 

9. Combining incomplete setsof ranks.—Sometime, den 
to get complete lists from all the judges ; the 15 Impossible 


cording 
Strongly 
generally 


Н : x Б a ay г; т 
but overlapping, sets of items: e.g. in Pritcha;a> ank partial 
gation the school children from different schools. a (1935) Eade ee 


t with different 


' 
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lists of school subjects. In such cases it is usually adequate to sum 
the total ranks received by each item and divide by the number 
of judges who ranked it. Alternatively, Pritchard’s method of 
averaging the deviations of each rank from the mean rank of its 
series may be fairer. Garrett (1924) has examined several other 
more exact methods, but fails to find any superiority in their results. 
For instance, the list supplied by each judge may be turned into 
percentile ranks, and the various percentiles obtained by an item 
then averaged ; or the lists may be turned into sigma scores before 
averaging. 
(iv) Paired Comparisons 

10. Here each judge compares each item directly with every 
other and records his preference. The pairs of items should be 
presented in random order. Thus if items а, b, c, d and e are to be 
judged, then the order might be ac, db, ec, ba, de, cb, ad, be, cd, ea, 
This technique was used by Thurstone (1928) in studying nationality 
preferences, by Wyatt (1937) in determining the main factors that = 
influence satisfaction with work, and in innumerable other researches. 
If there are n items, the total judgments required from 'each judge 


= (n — 1). Hence the task becomes very laborious when 7 is 
large, and it does not, apparently, produce any better results than 
does the ranking technique (cf. Guilford, (1936) ; Saffir, (1937)). 
It can however be extended to larger numbers if the items are first 
roughly rated by the judges, апа comparison of pairs is then applied 
to a few items at a time which are known (from the ratings) to be 
close together. (Thurstone (1930) has used the same device for 
extending the ranking method to large numbers of items.) 

The final order of preference is obtained from the total number 
of times each item is preferred. 


UDGMENTS IN EQUIVALENT UNITS 
11. All these techniques yield certain numerical values which 
represent the group's relative preferences for the various items. 
But such numbers suggest a spurious degree of accuracy. If, for 
instance, items 4, b, c, and d happen to have received average ratings 
or ranks of 3, 24, 2 and 14, we might think that æ is as much preferred 
to b as b is to с, and that а is twice as popular as. d. But such a 
i uld be no more legitimate than thé conclusion that a 

ts 34. is twice as nice as one that Ca ne If our 
hes on a ruler or pounds weight, then the units 
дыш {ыш and the conclusion would be true ; but we have 
as vet no knowledge of the size or the equivalency of our psycholo- 
à b its. Thus, although Wyatt's research shows that “ Security 
Ека ami "nent ” is the chief factor 1n job satisfaction for the average 
2 PE we are not entitled to deduce from his figures how 
NR more important it is ае other factors. Another defect 
of our present scales may be illustrated by the. ranking of, say, 
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ten pictures in order of preference ; a certain average rank for each 
picture is obtained. Now suppose that an eleventh picture is added 
to the series and that they are all re-ranked, we shall indubitably 
find that all the previous ten averages have been altered slightly 
by this introduction of a new picture. If, however, our figures had 
consisted of truly equivalent units (e.g. if we had simply measured 
the linear dimensions of the pictures) then such figures would not 
be altered by inserting additional pictures into the series. 

12. Comparisons of attitudes of groups.—One of the main uses 
of the techniques, which has not been mentioned so far, is the 
comparison of the opinions of different groups of judges. Shakes- 
peare (1936) compared the school interests of children of different 
ages ; Eaglesham (1937) contrasted the educational ideals of Scottish 
and English teachers ; Wyatt studied work attitudes in different 
factories where different conditions prevailed. Wickman (1928), 
Stogdill (1934) and others obtained ratings from teachers, parents, 

.and psychiatrists as to the relative seriousness or harmfulness to 
children’s mental health of a number of bad habits ; the differences 
between their lists were very illuminating. Innumerable other such 
investigations might be cited. Now in none of the British experi- 
ments was it known precisely how significant were the differences 
between the groups, whether they might not be due to chance 
factors, inherent in the unreliability of the scales of preferences, 
The exact amount and the statistical significance of such differences 
can, however, be readily determined nowadays if the preferences 
are properly scaled. 

13, Scaling of paired comparisons or ranked data.—The t 
used for scaling have been developed chiefly by Thurstone* on the 
basis of the psychophysical techniques used in grading sensation 
intensities and the like. Не holds that, in so far as it js possible 
to assess degrees of brightness or loudness, and to say that the 
difference between the intensities of stimuli а and b is equal to the 
difference between с and d, to the same extent it is possible to 
assess degrees of popularity. For a full account of the theory anq 
the application of the techniques the reader should consultThurstone? 5 
articles (1927 ab, 1928, 1930, 1931a), or Guilford’s book 936); 
we will only attempt here a somewhat schematized outling of the 


echniques 


method. Ды, i i 

The principle employed is that equally often noticed д 
are equal, Dat it Le BI of a being judged higher dn ferences 
than û is the same as the probability that c is judged higher a Cale 
then the differences are psychologically equal. And the Unit in e d, 
of which these differences are expressed is the standard devi terms 
the distribution of preferences. | i ation of 

Knowing from our experimental data what is the Probability, ok 

: 3 a ln 


* McK. Cattell, Thorndike and others had previous] ad 
physical methods to educational scales and other PSycholo apteq PSycho- 
e.g. for the purpose of grading a set of samples of handwrig Scal variabli 
merit in terms of equivalent units. 1 ез, 


cording to 


J 


J 
i 


11 
preferred to b, we can read off from the Kelley-Wood Normal Curve Tables 


x 
the corresponding value of 5; ; * then represents the true amount of the 


difference in rational units between a and b. We can perhaps clarify this 
principle, and show how 2 is obtained, by illustrating its application to the 
scaling of rank orders. Let N persons arrange 7 items, Ср uada Sa 
order, assigning to them ranks R,, Ry R4....Rg. Now if a considerable 
number of judges place Sg at, say, Ryo, almost as many are likely to place it 
at R, and R,, ; rather fewer will be likely to place it at К, or Ку, and very 
few are likely to put it as high as R; or as low as Rj. The distribution of 
judgments.tends then to conform to a normal distribution curve, centred 
around the mean value of the R’s assigned to this S. Similarly the R's 
given to Sp will tend to be distributed normally around some other R, say 
Ra; These two distributions possess standard deviations ay and cy. Now it 
will be seen that the differences in rank positions given to Sa and Sp by different 
persons will also tend to vary. The commonest difference will be К — Rio 
— 4 ranks, but among a few judges Sa and Sp may be as much as 10 ranks 

apart, and in a few Sp may be ranked higher than S4. These differences then 

will also tend to be distributed normally, and it is this distribution of prefer-^ 


ences which possesses the standard deviation XY. 


RI DIE Алаша: © t 
Now X = Mag? + oy? — 2roggy. Hence the required difference between 
Sa and Sp = 7 = (3 Мба? - Gp? — 2rogoy. Sa — S, or Sg — Sa, ete. 


may be similarly derived. In actual practice it is usual to determine the 
scale separations only for adjacent pairs of items, i.e. between pairs which 
are next to one another in average rank, since the greatest reliability is thereby 


fied in assuming no р 
1 = for most purposes it is adequate to regard са, бу, To 
lo a 5 Чүп that case the calculations reduce to 


etc, as identical and equal to unity. 
X 4/2. For the scaling of ratings a some- 


the simple quantity, Sa — 5 = ў А 
i i i loyed, which is described in 8s 45-48. _ 

what diferente mi of кйин We wil give one, illustration o 

Thurstone's results, namely, the scaling of criminals RUNE to 

the seriousness of their crimes (1931а). Gangsters and kidnappers 

lly be regarded as worse than bootleggers, who would 

V senses uld be worse than tramps. By 


ickpockets, who WO mas 
соге faired comparisons of 13 types of criminals from 240 


i % ble to quantify these differences. Оп а 
pu ла “gamblers at 1:5, pickpockets at 1:9, bootleggers and 
rds: 


and kidnappers at 3-6: 
2.6, gangsters ~ 4g made possible by means of such 


urement of the effects of various types of 
y ү the enn ae ш attended a 
propaganda. tion picture film “ Street of Chance," which 

of the. um unfavourable light. They then sorted the 
g i w scale gamblers had gone up from 
= whilst the other criminals remained 
15 to 2.1 (а 20 P positions. Similar experiments have been 


Performed on & d 


(40006) 
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to prohibition, etc. In the majority of these the propagandist film 
was found to have had positive effects, which sometimes lasted for 
a considerable period (cf. Peterson and Thurstone (1933)) 


C. THE RELIABILITY OF GROUP ATTITUDE SURVEYS 


^ 15. Calculation of reliability.—Reliability may either mean the 
extent to which results remain unaltered if the testees or judges 


repeat their judgments after an interval (it being assumed that no | 


propaganda or other influence has operated during the interval to 
affect their opinions) ; or it may mean the extent to which the results 
resemble the results obtained from another, similar, group of testees 
or judges. It is better to denote these as the repeat reliability and 
the consistency, respectively. With large numbers of items, repeat 
reliability is readily determined by inter-correlating the two sets of 
results ; with smaller numbers the average amount of change that 
has taken place in the second trial should be examined. For the 
determination of consistency the most suitable technique is to 
calculate the average inter-correlation between the judgments of the 
testees, by means of formulae which Kelley (1923) provides (No.171 
for ratings, No. 172 for ranks). If we have N judges, and the average 
correlation between any pair of them is 7, then we can predict by 
the Spearman-Brown formula that the correlation between the total 
Tesults and the total results of another group of N judges will be 


Nr ` 
I+ (N =. r 
This formula has another use. Supposing we find the reliability 
‘of the results to be inadequate, say, a coefficient of + 0-60, then we 
can predict from it that we should multiply the size of our group by 
six in order to get a satisfactory figure of 0.90, Generally speaking 
it is desirable to have at least a hundred judges in order to get a 
reliable set of ratings, rankings or paired comparisons, 
16. Factors influencing reliability. — Reliabilit 
depends not only on the size of the group of judges, 
homogeneity, and on the clarity or unambiguousness of the item, j 
It will be reduced if the judges are very diverse in their opini S. 
and will be low if they find much difficulty in understand; ons, 
items, or are for other reasons uncertain about their Prefer? the 
Simple judgments of relative popularity are likely also to be Nces, 
reliable than judgments of equivocal qualities such as the « ef деге 
ness" of adveri-aments ог the “ practicability " о; G d ive- 
ideals, etc., unless these are thoroughly and concrete]: d соода 
the quality to be assessed is composite, it is better to split it LN 
* The legitimacy of applying the Spearman-Brown fo: 5 
judgments ERE be ааа (сї. Guilford, (1936)). d to Attitude 
‘however, shows that its estimates, when applied to ratings of character (1981), 


Y (consistence 
but-also on tho 


are fairly accurate, thuugh not so accurate as with edacatio, trait 
Rosander (1936) finds that increasing the number of sorter! tests, D 
Thurstone-type attitude test (cf. $s 46-48) increases the relia In scali Ea 
sortings to the extent predicted by the formula. "We therefore шну of the 
em justifi 
ed 


in using it in the present instance. 


——— 
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series of components which may be rated separately and then later 
combined (cf. analytic rating scales, § 90). For instance, instead of 
merely trying to mark the “ goodness ” of a set of children’s English 
compositions, it may be preferable to mark them separately for : 

General impression 

Mechanics (punctuation, spelling, grammar, etc.) 

Content (range and appropriateness of ideas) 

Style (expression of these ideas). 
These four qualities may then be weighted according to their relative 
importance before they are recombined.* 


reliability. It is much easier for judges to decide on their preferences 
when some of the chi 


this may sometimes be interpreted as stereotyped prejudice 1 
the judges. Thurstone (1928) points out that in a nationality 
preference test, scaled in equivalent units, the width ofthe scale or 


the extent of separation between the various nations (which is 
s) gives a 


Chant and 


Much the same point is broug 1 : 
Thouless's studies of the distributions of responses to single items. 
Allport (1934) showed that when there is pressure in а social group 
towards conformity to а certain opinion or а certain course of action, 
then the distribution of opinions and actions tends to be, not the 

For instance, the strength of 


1 normal curve, but à J-curve. r 
TET to contraceptive methods among Roman Catholics would 


[ COn ас d Thouless (1935) found a U-distribution 
yield such a distribution. An and other controversial issues. 


n religious 
of opinions on certain religi rei ucc Ва Ge E 


ts was ins odes 
When a group of students of belief or disbelief in such statements 


scale their degree o A e 1 
es DCUM such spiritual beings as angels," there were far 
more + 3 and — 3 judgments than 3 2) 

1, which would. indicate uncerttnty, or a tolerant, 
inion. ' On the other 
balanced орге and from any pressure towards conformity (e.g. 
ee pee ‘ound in certain parts of China ") yielded an excess of 

igers 2 0 
intermediate judgments, © E | 

i this analyt Po ure for the purpose of improving 

* In applying almost certain to find what is described below (§ 108) 
reliability, WÊ ae all the separate judgments will be much affected by the 
as the halo effec like ог dislike for the compositions. Hence it would be 
judges’ genet des qualities at their um value. But the effect is immaterial 
ns t g to be recombined. 
when they 
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Clearly these social forces or biases which influence responses to 


single items will also bring about high reliability of discrimination 
within a series of items. 


D. THE VALIDITY OF GROUP ATTITUDE SURVEYS 

18. By the validity of a test of, say, mechanical aptitude, we 
mean the extent to which the results of the test actually predict 
the testees' ability in mechanical work. No such objective criterion 
of validity is available їп а test of character traits or affective atti- 
tudes. But we still wish to know what the results of the test or 
scale really show. Can we, for instance, accept Eaglesham's conclu- 
sion that Scottish teachers are less progressive than English, on the 
basis of their ratings of educational ideals; or Shakespeare's 
conclusion that children chiefly enjoy school subjects which involve 
activity or which embody concrete human interests ; or that (from 
а newspaper competition of some years back) Fildes's “ The Doctor ” 
is the most popular picture known to the British public. 

It seems probable that the validity of the type of scales which 
has concerned us so far is on the whole better than that of any of the 
scales or tests that we are to consider later. There are indeed 
weaknesses, but these are likely to be intensified in the instruments 
discussed below: A number of flaws which tend to influence the 
validity may be pointed out. з t 

19. Representativeness of sampling.—In the first place it is very 
unsafe to make deductions about other groups, or about people 
in general, from results obtained with a particular group of judges 
(cf. Vernon (1938)). The popularity of « The Doctor” is indeed 
assured among those who entered that Competition, but does not 
necessarily hold among all readers of that newspaper, and is most 
unlikely to apply among readers of The onnoisseuy. Similar], 
the Literary Digest straw vote before the 1936 0.5. ЗЕД 
election did not validly predict the election result, though it did 
prove that the Literary Digest’s method of Sampling the Populatio, 
was totally inadequate (cf. Robinson (1937)). Clearly then ч n 
essential to study the constituent members of the group thorou s 
before applying their results to other groups. A complete stud y. 
unnecessary ; only those characteristics which are likely to be G У 15 
lated with the attitude that is being measured need to be cont She: 
For example, the churches to which members of the Toup тоне 
need hardly be take into account in studies of attitudes to fac e) 
Work, but might have an effect on their nationality рана ОУ 
Socio-economic level, intelligence, sex, and possibly age are Dces, 
to correlate with almost all the attitudes we have mentioneq ely 
it is generally desirable to assess these (a fairly easy matter) ~~ 105, 
take them into consideration before drawing any wide Cone] yas 
from an attitude survey. i Usions 

20. Relation between verbal opinions and be Viour 
caution is essential before assuming that the results of at C 
will be predictive of actual behaviour. We would Tefer the i 

ea, 


` ing sentiments may co-exist, and that so: 
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here to G. Allport’s discussion of the theory of attitudes (1935), 

where he shows that attitudes should not be defined either as types ~ 
of overt conduct, nor merely as verbally expressed opinions or 
beliefs, but rather as directing or motivating tendencies which Не 


. behind both behaviour and opinion. Thurstone and Chave (1929) 


also consider that actions and verbal opinions may be equally 
fallible as manifestations of an attitude. Any piece of behaviour, 
or any expression of opinion, has so many determinants that tests 
and questionnaires can hardly be expected to cover them all: As 
Katz and Allport (1931) point out, our publicly exhibited attitudes 
(which is what the tests chiefly tap), our deeper and more private 
motives (which may sometimes be reached under favourable con- 
ditions), and our actions, are inter-connected in so complex an 
organization that many discrepancies between them may be apparent. 
For example, some persons may put Americans and French near the 
top of their lists of nationality preferences, and yet behave in a far 
from friendly manner towards particular American or French indi- 
viduals. Again, many professed pacifists will doubtless enlist 
when war is declared. But these instances merely show that conflict- 
me attitudes måy be flexible 
Thus, it would be unfair to criticize attitude 
able of revealing every side of human 
study of the individual or group is 
he or they will feel 


and open to suggestion. 
surveys because they are incap 
nature. Much more detailed 1 
necessary before we can predict accurately how 
and act in any specific concrete situation. > 

91. Dependence of validity on the judges’ co-operation.—Next, 
the Einstellung, or attitude of the judges to the investigator and 
the investigation, is of prime importance. The great majority of 
people are quite incapable of understanding the object of a psycho- 
logical or sociological: experiment, and are only too likely to be | 
suspicious, and therefore to produce only what they consider to be 


Б table opinions, or to try to give the judgments 
conventionally accep d xpects of them, or to refuse 


2 А investigator €3 
which they think the Nada Stagner (1933) point out that most of 


outright. Allport (1937) а jes have been carried out on University 


ttitude stud D i | 
ee co-operation is readily obtained ; applied to such 


rmers, the same scales may be entirel 

groups PEE st fich the results are influenced by ШОЛ 
useless. Fi = arses depend largely on the type of information 
factors T for. There will Seldom be any, embarrassment over 
that is aske inions on film stars ог advertisements, but there may 
producing pea of hesitation over revealing attitudes to employment 
bea good ST {саг lest they be passed on to the employers. 
conditions, Р nonymity certainly help, and yet may fail to remove 
Assurances 0 E against making one’s private attitudes public. 
all the inhibit! nd, it may happen that the judges feel more secure 
On the other ha eee sheet or questionnaire to fill up than they do 
when given I2 uestions in person, since they may be more 
when aske' 


the same d* : : 
doubtful of the anonymity омер. 
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An introductory talk to the group with the purpose of securing 
good co-operation and frankness is obviously desirable. Perhaps 
even better was Wyatt’s method, where the judgments were obtained 
during an informal interview with each individual. For co-operation 
is essentially an individual affair ; different people need to be appealed 
to in different ways. A further disadvantage оѓ’ applying the 
questionnaire to a group simultaneously is that neighbours are 
likely to be able to see one anothers’ papers, and so some may 
be deterred from putting down unconventional opinions which 
their associates might ridicule. The writer has found this to be 
particularly prevalent among school children who are usually so 
crowded together that they can easily overlook one anothers’ test 
blanks, 

22. Mutual cancellation of variable -errors.—We would repeat, 
however, that these methods are most often applied to issues about 
which there are few inhibitions, so that the validity of the results is 
much less affected that in some of the later tests. It should be 
"remembered also that, as we are interested only in group results, 
a number of individual variations in attitude to the investigation 
will tend to cancel one another out. For instance, those judges 
who treat the whole thing flippantly may be counter-balanced by 
those who are over-conscientious. It is only the “constant errors," 
which are in the same direction in the majority of members of the 
group, that affect the validity; other, variable, errors merely 
decrease the reliability. The same is true of errors that may be 
introduced by variations in the judges’ temporary moods or by 
their recent chance experiences. When the writer asked his students 


for criticisms of a number of questionnaires and tests which they 
were answering, they frequently raised this point, asserting that 
they might answer differently on another day when they were feeling 
happy instead of depressed, or vice versa, or when they might perhaps 
have read something in а. newspaper which influenced their judgments. 
Only if a large proportion was similarly influenced would it upset the 
group results. 


93. Dependence of validity on judges’ interpretation of the test. — 

her-important ‘weakness in attitude surveys is our uncertainty 
as to the way in which the judges have interpreted the instructions 
and the test material. Although the objective conditions of testing 
may be the same for all, yet the same words may often have very 
different meaning" for especially when the words 


0 from the meaning which 


Here the judges, 


Tequested to rate a series of educational 
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aims (e.g. “ То develop in the child the éapacit for employi i 
eure in fitting pursuits ") according to tice SEE үс 
© the extent to which they would follow each aim in: teaching 13- 
year-old elementary school children. Not only are the items 
ambiguous, but also the judgment of practicability is far more 
confusing than a judgment of like-dislike. Here then the validity 
a the results may be seriously questioned ; for it is conceivable that 
the Scottish teachers (who were found to be less progressive in their 
[шш than the English) were interpreting the instructions more 
terally and were actually rating the aims by the extent to which 
they did apply them in practice, whereas the English may, wittingly 
or unwittingly, have rated them more on the basis.of desirability. 
Probably Eaglesham's conclusion is valid, since further lines of 
evidence point in the same direction, but we are not entitled to 
assume it from the original questionnaire resultsalone. Asa general 
rule it is highly desirable to supplement the findings of any attitude 
survey with’ other data, if possible of a more objective nature. 
Although, as shown above (8 20), pieces of information about the. 
overt behaviour of the judges are not in themselves adequate criteria 
of the judges psychological attitudes, yet the validity of 2 subjective 
scale will be greatly enhanced if the available factual data do 
apparently fit in. To take another instance: @ scale of popularity 
of film stars is unlikely to be much affected by errors of equivocality, 
and yet its validity will be improved if supplemented by box office 
data as to the drawing capacities of these stars. 
Ё 94. With certain types of items it may be far from easy for the 
judges to express their real responses in terms of simple judgments 


-of preference. They may wish to say : «Т prefer A to B in some 
depends on such and 


ways, and B to A in other ways," Or, “Tt all 
such circumstances." These difüculties reduce the reliability and 
make it especially necessary to find out what the judges did mean 
by their preferences. Particular caution is needed in the inter- 
pretation of omitted items and judgments of equal, or neutral (e.g. 
ratings of zero on a 4- 3 to — 3 scale). They may mean that the 
judge has never thought about that issue and has no opinion either 
way, or that he is unable to make up his mind in view of strong 
conflicting inclinations. F ye 
Some writers (e.g. 9 onds (1931)) imply that such critical 
reactions and carefully thought out responses аге unnecessary and 
results accrue when immediate 


undesirable, claiming that better r 
It may be thác trained pyscholo- 


affective impressions are given*. ‹ 1 
gists can fairly readily adopt such a naive attitude while taking a 
* The writer has been unable to discover in the liter: 
proof o^ this rather important principle. Some confirmatory evidence comes 
from Estes's (1937) study of the ability of various groups to judge personality. 
. Artistically inclined persons to be greatly superior to 
and other University teachers, and the former claimed that they used 
intuitive, impressionistic methods of judging, whereas the latter were more 
analytic and intellectual. Further work, both in relation to attitudes and 


to judgments of character, is much needed. 


ature any definite 
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i i mbers 
test ; but they fail to realige that it does pees Е ү ME 
ther professions. In the Writer's expe 5 саш 
ан Ae very frequently raised by intelligent еш аи 
who have developed cautious and critical habits of thi 16, апси 
dislike forcing their complex attitudes into the дом ES Edo. 
which the experimenter provides. Other persona. such ligne 
freshmen and secondary school pupils, although high in in 


i find the same 
5 cated, and so do not seem to : 
dU dde. Seat nre however, be too naive. Pre-adolescent 


questioning may often provoke responses which would not normally 


€, it has often been stated that 


ferences which are entirely 
outside the range of their actual capacities, 


But a research in 

bridge Psychological Laboratory indicates that 

Phe уу Р ich exaggerated as а Tesult of some of the 

tests employed ; that many boys may never actually have wanted to 

be “ Architects, astronomers, consuls, Surgeons, etc.” but that 

given a list on which they are asked to express their likes or dislikes 
for such occupations, they naturally Put down “Jike”? 

26. Precautions in devising and applying tests, 
investigator who compiles the original structions and test items 
needs considerable insight into the probable reactions of his judges, 
in order that he may get from them the most Valid and reliable 
Tesponses. Since he is himself quite likely to Possess ап affective 
attitude towards the Issue in question, he needs to 
opinions contrary to his own are Properly 


— Clearly then the 


axe sure that 
: : represented in t 
given. Suppose, jor example, an industri 


: : employer h 1 

to carry out an investigation of job Satisfaction Ee ad wished 
he might easily draw up a list of Sources В 
faction which to him seemed comprehensive, ut Which might ont 
several items that were of importance to his employees 

was able to avoid this pitfall since he had alr 1 mde 
qualitative study or the main factors involved, an g 
likely to be biased than our hypoth SEE 
Valentine (1934), before drawing Teasons for entering 
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the teaching career, discussed the topic’ with a number of students 
and got them to suggest the main reasons. In general, therefore, the 
investigator should himself be familiar with all sides of the issue to 
be surveyed, and should if possible consult others who can provide 
him with fresh viewpoints on this issue. He should then try out a 
preliminary form of his scale with a small group, representative of 
the group Whose opinions he wishes to survey, and ask for criticisms 
or difficulties or further suggestions, and only then begin the mass 
investigation. 

97. Finally we would recommend the adoption of a hint from 
experimental psychology. In the typical laboratory experiment, 
the conditions are standardized, and conclusions are drawn from 
the observations by rigorous logical or mathematical procedure. 
But at the same time full introspections are obtained from the sub- 
jects in order to ensure that the meaning to them of the various 
conditions was also standardized, and to aid in the interpretation of 
the results. This same step is desirable in most attitude surveys ; 
either the judgments should be discussed at an interview between the 
investigator and each judge, or, if time does not permit this, space 
should be left on the test blanks for spontaneous comments and for 
reasons why such and such items were preferred. The investigator 
cannot of course apply mathematico-logical pre^esses in generalizing 
from such additional data; inevitably he selet*s and interprets, 
and so lays himself open to the charge of bias, though he can reduce 
this by including liberal quotations from the comments in his 
published results. Yet such comments help immensely to confirm or 
contradict the validity of the conclusions which he draws from 
the purely quantitative data. Good instances are provided by 
Wyatt’s research on job satisfaction and Pritchard's on school 
subjects. Sometimes indeed the most interesting and psychologi- 
cally important findings arise from this material, rather than from 


the attitude test itself. : И 
So far we have dealt only with tests or scales for sampling 
group opinions. ‘All the following tests can be similarly employed ; 


i individual’s 
but they are also supposed to be capable of measuring an in 
т and therefore involve certain fresh considerations. 


ES FOR MEASURING ATTITUDES OF 
INDIVIDUALS 


А. DESCRIPTION OF TESTS~ 
jabili d the errors amon, 
many sources of unreliability an g 
` доз КАШ, as we have seen, tend to cancel one another out 
indivi Sate make it desirable to adopt more elaborate methods 
in B ary) duals. The responses of an individual to a test of 
d preferences might indeed suggest е пр UCET т, 


enerally unfavourable to Fascist nations; or his choices in a 


questionnaire on “ your favourite п 
taste. Put such deductions wou 


IIL—TESTS AND SCAL 


ovel " might indicate his aesthetic 
1а be extremely unreliable, and 
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they would entirely fail to measure his attitude, to show how un- 
favourable he was to Fascism, or how strong was his aesthetic 
sentiment. It might be thought that in order to determine a man’s 
attitude to art, religion, etc. or his general sentiments such as 
pacifism, nationalism, tolerance, etc., we could simply ask him 
straight out. But again we should not be able to grade his answers 
quantitatively, to say how he compared in these respects with people 
in general. Moreover he would quite likely have only very vague 
notions as to what such a general attitude entails, and might interpret 
the question differently from our expectation ; hence the significance 
of his responses would be dubious. And direct questioning is 
obviously undesirable because he would be liable to produce merely 
conventional opinions on politics, morals, etc., which he regarded 
as appropriate to the particular social situation. Apart from 
conscious falsification, he might not know himself sufficiently well to 
be able to deliver a true judgment as to whether he was tolerant or 
prejudiced, artistic or Philistine, and so on. 
- 29. Tests of radicalism—convervatism.—In general therefore an 
attitude test consists of a long series of questions, or of statements 
with which the testee is invited to agree or disagree ; and from the 
general trend of his responses, the amount or strength of his attitude 
may be deduced ir. quantitative terms. For instance, a so-called 
'opinionaire' fo measuring radicalism or conservatism might 
contain 20 to $Ü statements, of which the followi 


ere А ng (from Lentz’s 
C-R Opinionaire, (1934)) are typical. 

The metric system of weights and measur’s should be adopted instead of 
our present system. 


Even in an ideal world there should be Protective tariffs, 
Conscience is an infallible guide. 


Armistice Day should be celebrated with less 


After each statement is printed Yes, No: 

Meaning Doubtful); or True, False; ог { d xen) ?, No. (? 
responses is to be underlined or encircled, Alternativel пао these 
of agreement may be denoted by ratings, e.g, ү У the extent 
— 2; or the ‘ multiple choice ’ technique may De 3 
several possible responses are provided, one of which 

checked. The following example is abbreviat, ch has to be 
questionnaire (1930). 


What are your views on HEREDITARY WEALTH ? 
(1) All wealth should revert to the State at deat] 
(2) Taxes should çonfiscate the bulk, leaving oak 
dependent women and children, У enough for Support of 
(3) шы соо сша be Pra ona Tapidly айс 
abou per cent. for large ; E ed slidi 
(4) Very large fortunes should pay a inheritances, а? 
so high аз to become confiscato; 
(5) Individual thrift and ini 
inheritance taxation. 
Such statements are relatively 2 
; я А coni E 
к S general AE is likely bis ee Specific, so that the 
he chooses ; and a basis for В Pressed in 
: Stading the amount of pic 1е8ропѕез 


his attitude is 


martial spirit, 
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provided by the proportion of his responses which point in the same 
direction, or by the strength of these responses. 

_ 80. The Allport-Vernon “Study of Values.?— This test (1931) 
gives another illustration of the method. It is designed to measure 
ап individual's relative standing on six main types of values or 
general interests : theoretical or scientific, economic or utilitarian, 
aesthetic or artistic, social or humanitarian, political or power- 
seeking, and religious or spiritual The testee is required to rank 
his order of preference for the four answers to questions such as : 

1f you could influence the educational policies of the public schools of 


some city, would you undertake— 
(a) to promote the study and the performance of drama, 


(b) to develop co-operativeness and the spirit of service, 
(c) to provide additional laboratory facilities, 
(d) to promote school savings banks for education in thrift ? 
If he puts the answer (c) top, he scores 3 marks for theoretical values, 
and if he puts (а) bottom, he scores 0 for aesthetic values. The 
other two answers refer to social and economic values. Ву summing 
the answers to all the questions a range of marks of 0 to 60 is possible" 
on each value. 5 
31. Watson's test of jairmindedness.—A more elaborate disguise 
is adopted in G. B. Watson's ingenious test of fairmindedness or 
prejudice (1925). Obviously the name “ prejudice " must not be 
mentioned, hence the testee's blank is headed “ A Survey of Public 
Opinion." Nevertheless the degree of prejudice in his political ard 
ethical opinions is deduced, in six sub-tests. For instance, one of 
these contains a series of statements such as : 
All Most Мапу Few No Roman Catholics are superstitious. 
The testee who encircles either “ All” or “No” is deemed to show 
- prejudice in his response; any of the less extreme answers are 


accepted as à sign of tolerance. In another sub-tes: «ppear state- 


ments such as :— 
In the United States 
the wealth. 


Following it are Sever 


including : 
The great incomes should be more heavily taxed. 4 
Such a concentration of capital is inevitable if industry 1s to be 


effectivel developed. р 
No БУЕ. here stated can fairly be drawn. 


The testee who checks the first of these as a legitimate inference is 
credited with socialistic prejudice ; the second—capitalistic ; only 
3f he checks the last does he obtain а mark for fairmindedness. 


З per cent. of the people own 60 per cent. of 


al possible conclusions that may be drawn, 


B. THE FORM or ATTITUDE TEST STATEMENTS 
3 o 


32. Research on attitudes 15 regarded ies vUa important 
that much attention has been paid to the Des ype of questions or 
statements to use (cf. Stagner (1933) ; Droba (1932)). Kulp (1933) 


AE NOD 


shows that different tests have employed items which express d 
(а) the testee's personal conduct, 
(b) his beliefs or opinions, 
(c) his judgments, 
(2) actual facts. 


Kulp finds that even when identical issues are expressed in these: 
various forms they may evoke somewhat differen 
fourth form is clearly unsuitable ; 

applauded, but they are not genu 
writers prefer the first form ; 


present or future actions seem 
‘directly. 


83. Wording of items.—Wang (1932b) gives a detailed list of _ 
rules for the wording of items. They should be Short, simple and _ 
unambiguous. The following is bad, since it Might be taken as 
Tepresenting opposition to, or support for, birth Control, 

Birth controf legislation is a disgrace to our Civilization, 

Double-barrelled statements are always ineffecti 
testees may pay attention to one clause, others to th 
Most important of all is their releva 
to which they cover all sides of 
recommends that a careful SERRE 
the attitude, e.g. to find out what radicalism really jmn]; 
items should then be chosen to represent i ML Dd ee 
analyst: caus a 11988) 1 cts о í 

Rundquist and Sletto ( have de f 
any attitude test which asks for socially accepta s Clearly that dd 
opinions; questions which state the acceptable vie г Ере 
always answered in the same Way as questions of appa oint are no { 
content which state the unacceptable Viewpoint. rently identica d 
if 40 per cent. of a certain Broup answers Yes or Түп 01 instance, 

The Government's policy is subse: Tue to: 


rvient to big business i 
it will not usually be found that D 


the same t 
i - estees. 
proportion of testees will answer No ог Falke to. Or even the same 


The Government's policy is independent of big n 
Slight differences іп meaning cannot ассо , 
Iesponses. It appears that there is а ai for the different 
unpopular questions or items to arouse Senera] tendency for 
answered less rationally than the same items «t P Оп, and to be 
way (cf. §158). Since the two types of question ied in the estas 
aspects of the attitude, it is recommended b ыз off different 
both types should be included in а good ae ese anthers that 
numbers. ude + st, in E 


Ve, since some 


€ other (cf. 846). 


nce to the issue, and the extent 


thé issue, Kirkpatri 
patrick (1935) 
first be made of the Ee of - 


y 


Í 
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Another reason for including items which are “ pro ” and items 
which are “con” a particular issue, is that the testees are likely 
to read a mixed list more carefully. In an opinionaire which 
consisted solely of radically-toned statements, the testee would 
be liable to check Yes (or No) throughout, without properly consider- 
ing their merits. The two types should, of course, be arranged in 
random order. Similarly in a multiple choice test the most “ pro "' 
response should sometimes be printed first, sometimes last. This is 
desirable also so as to eliminate “‘ space errors.” For Mathews (1927) 
has found that there is a considerable tendency in multiple choice 
tests to check the left-most, or top-most, responses rather than 
responses printed on the right, or at the bottom, of a series. 


C. THE STANDARDIZATION OF ATTITUDE Test ITEMS 


34. The single test items quoted above are easy to criticize on. 
s that they would not necessarily show radicalism, 
But it should be remembered that .. 
a test consists of a considerable variety of such items, each one of 
which is only supposed to be a partial manifestation of the general 
attitude. Just as arithmetical capacity is measured not by a single 
sum but by a test including many varied arithmetical probiems, 
so also is attitude measurement approached. And the total score 
on the test will be much more valid than the results of separate 
items. Moreover, every good attitude test is subjected to a thorough 
examination before publication in order to ensure that the items 
are consistent, also so as to prove that the test is reliable, and to 


provide norms. іры . Е 
As already mentioned, we have no criteria for proving directly 
though such indirect methods as can be 


the validity of our tests, 
applied are generally favourable (cf. $59). But we can and do 
try out the extent to which the items hang. together consistently ; 


thus we do not claim merely on the basis of our own a priori notions 
that the four statements from a radicalism-conservatism opinionaire, 


bove, are connected with radicalism or its reverse. 
oot tigating this matter are the 


The two chief methods for inves 
method of internal consistency and the method of external 


judgments. 


the ground: 
scientific interests, prejudice, etc. 


RNAL CONSISTENCY METHOD 


rimenter first compiles a much? larger number of 
ind жа e Wil need in the final form of the test, and triés out this 
preliminary draft with a group of testees. Не assumes that although 
many of the items шау be poor, yet on the whole the errors among 
these are likely to carse: out, SO that the total scores possess some 
validity. He then analyses each item to see whether or not the 


і te with the test a» a whole, and calculatesa 
responses to it agree predictive value, i.e. the extent of 


NE nting 115 2 
statistical index representing tive index is too low, the item is 


; eement. If the predic : А Ene 
this a6 minat cd, or modified and tried out again until it proves 


D. THE INTE 
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satisfactory. The best items are then used for the final form of 

the test*. \ Г 
36. Item analysis techniques.—The various types of index, or techniques 
for comparing single items with the total test, are described by Lentz and 
Hirshstein (1932), Zubin (1934), and Guilford (1936). АП of them, it should 
a large group of testees ; if the group numbers 


‚ then the indices of the various items wil! be so 
unreliable as to be Practically useless. Sletto (1936), for in: 


bi-serial # is employed. 
(2) Lentz's method (for dichotomous ite 
total number of conservative res 


(e) None of the above methods allow: 


here—is an alternative which satisfies these dual requirements 
fairly effectively. 

87. Weakness of internal consistency methods, 
weakness in all internal consistency methods is th 
serve to select items which agree well with the ori inal tota] score 
The experimenter’s preliminary draft may, for instance embody 
two entirely dissimilar attitudes, which Should have been а 
Separately. And yet, as Sletto (1936) has Proved, the intern, 1 
consistency method will pick Out the items that correspond best 
a composite of both these attitudes, and will fail to reveal the ДО 


-—(7) A grave 
at they Merely 


€ геѕро 
У on the content of t B nse to an. 
influenced to some extent by the testee’s hat item. > у 


general set towards adjoining a, fe 
таз 


` „ы 


- 25 * 


Other errors or biases ma: i igi 

à с s may have been present in the original form and. 
үе val still persist in the final form. Thus, eee 
50, ave wished to assess general radicalism ; but his first set of 
A pud have been predominantly political, failing to cover other 
н pects of the attitude ; or they may mostly have been appropriate 

о a test of liberalism ог tolerance rather than radicalism. The 
process of item analysis serves to increase rather than reduce such 


biases. 

88. Application of factor analysis.—Many writers believe that 
an attitude test should measure only a single uni-dimensional 
variable, and that if the original test contains items expressing 
several tendencies, then it should be split up into its components. 
ply the methods of factor 
We shall discuss factorial methods below (§s 61-67), 


between the responses to 31 miscellaneous 
in arriving at an apparently meaningful set of attitude factors (§ 66); but 
it is obvious that these are resu items with which he started, and 
that they can hardly lay claim to much significance because his original 

selection possessed no logical basis. 

Experiments in the factorization of attitudes are certainly of value, since 
they will throw much light on those attitudes which are sufficiently simple to 
be scaled as uni-dimensional variables, or whi 
worth measuring separately. But they are not yet ina position to show us 
which items to select and which to reject in our attempts to measure any 
particular attitude. f 2: А у 
39. Conclusions.—In view of these difficulties, Kirkpatrick (1936), 
Allport (1935), апа others including the present writer, consider that 
the insistence on strict uni-dimensionality is undesirable. An 
attitude should be accepted as а complex pattern of more or less 
dencies, which may lose its significance if it is too 
i i From this standpoint then 


h subdivided into sim) ; 
much su ; al consistency method of standardization 


the adequacy of the intern 


ds on the adequacy о Р E С 
F attitude, and the extent to which the original set of items covers. 
hi definition. In other words, statistical methods will assist 

15) Я -g the construction of a scale, but cannot produce 


greatly d nless there is also a sound logical basis behind the 


construction. 
Np PROVIDING NORMS FOR AN ATTITUDE, 
E. ScoRING ^ du INTERNAL CONSISTENCY ТҮРЕ x 
| А eighting of items.——In a test with di 
and weigl 3 i dichotomous 
40. 500 True False) items, the score is generally the simple total 


(Yes No, ОТ 
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of items responded to in a certain direction, e.g. radically. But in 
order to eliminate the effect of omitted items, the total conservative 
responses may be subtracted from the total radical responses. 
Scores may: then range from -+ » to — п, where n = the number of 
items. With triple items (Yes, ?, No), each item may score: 4- 1, 0 
and —1; 2,1, and 0; or 1, запад. 

Since some items will always be found, during the analysis, to 
have higher predictive indices than others, it would be quite feasible, 
not merely to eliminate the poor ones; but also to weight those that 
are retained proportionately to their predictive indices. This is 
seldom done, except perhaps in tests which contain a rather small 
number of items, since such weighting does not appear to improve 
the reliability toan appreciable extent. Moreover it greatly increases 
the labour of scoring. à 

41. Acceptability of items.—One other point to be examined in a 
test of this type is the degree 6f “acceptability” or “ contro- 
versiality " of each item. An item which is answered by over 90 
per cent. or less than 10 per cent. of a normal group of testees in the 
radical way ts of very little use for differentiating people’s radicalism. 
Usually the acceptability of-all dichotomous items should be fairly 
close to 50 per cent. } i.e. they should be answered in one direction 


by about half of an average group. Thus an individual’s final score i 


is based more on the number of situatio 
a čertain sentiment than on the strength of the sentiment implied in 
the successive situations presented to him. 

42. Scoring multiple choice items.—In a test where each item is 
provided with five or more grades of response, it is usual to assign 
numerals arbitrarily to the various responses, e.g, 1 for the most 
conservative, 5 for the most radical response, Likert (1932) tried 
out a more rational method, basing the scores on the znfrequen 
with which a response was Siven, and turning these eee а n 
the corresponding normal deviates o ue nude) 
the five responses to one item were checked 


ns in which he expresses 


T sigma scores. For instance 
by 13, 43, 21, 15 


and 10 per cent. of testees respectively ; the correspond; t 
Scores are — 1-63, — 0-43, + 0-43, + 0.99 rs 1-56. ШЕ чы i 
however that the simpler method of scoring ound 


reliable a scale as did this more elaborate ео ded almost аз 
In multiple choice tests the acceptability of + i 
of the five So be as close as possible T 50 ER Pe response 
average group of testees. The final score is, of ene among an 
on the number of items which the testee answers in A ceris based both 
and on the strength of his responses, ain direction, 
43. Norms.—Standardization of an attitude test į 
collection of norms. It would be useless to find thee involves 
radical responses to 40 out of 60 items м testee А gi 
average and the distribution of scores among dth SO knew the 
A’s social o In obtaining norms the r 
must be applied to a large, rcpresentatiy, thi 
it is desirable to establish separate Soins т Of test es, RON 


EAS 


27. OUR 


may differ widel 

vod ely, e.g. for men and women, for students, for factory 
dol F. THE EXTERNAL JUDGMENTS METHOD 
. Many investigators, including Harper (1927), S 

tors, 27), Symonds (1931), 
е (1930), Lentz (1934), and Kirkpatrick (1935), have called 2 
ы veral judges or selectors in order to discover whether others 
esides themselves considered that all the test items were connected 
with the attitude they wished to measure. . If, for instance, half a 
tozen competent people failed to agree that “Conscience is an infallible 
pide really indicates conservatism, then that item would be 
eliminated. This seems to be a thoroughly sound principle, since it 


ensures that the attitude to be measured is fairly definite and 
It might well be 


uniformly identifiable by a group of judges. 

extended so as to provide a solution to the problem which we raised 
above, namely the definition of the scope of a complex attitude. 
For example the judges might decide what proportions of political, 
moral, aesthetic and other items should be included in a test for- 
general radicalism. 

45, Thurstone’s scaling technique.—An ingenious elaboration of 
the external judgments method has been developed by Thurstone 
їп his series of attitude scales. This enables him to measure attitudes 
in terms of equivalent units. We will outline the main steps in the 
construction of Thurstone and Chaves’s scale for Attitude to the 
Church (1929). First a large number of heterogeneous opinions 
about the church was coliected from various sources—editorials, 
conversations, and from statements written anonymously by 


students, etc., for example :— ^ 
i i the church is hopelessly out of date. 

st institution,in America today. 

urch but do not miss them much when 


Istay away. 1 
I believe the church has a good influence on the lower and uneducated 

classes, but has no value for the upper, educated classes. 
I am interested in а church that is beautiful and that emphasizes the 

zsthetic side of life. | 1 

By ranging far enough afield the investigator can соуег all shades 
and degrees of opinion, and so escape the narrowness to which 
opinions thoug himself are liable. Opinions which 
are clearly irrelevant to the main issue should be dropped; others 
cted will be eliminated auto- 


which are insu 
matically ій the subsequent standard 
46, Having obtained 130 Mu 
; i th e treatmen S 
Chave applied to : that‘is, they found out from 


i h attitude surveys (§ 8) ; th 
ои w favourable to the church or how unfavour- 
Eble wes each Paired comparisons or ranking could be 
hwique, based on the psycho- 


intervals, is preferred. (Saffir 


(1937) show: 
scales): 


(40006) 


с 


28 


opinions into eleven piles, regardless of their personal views about the 
issue. The pile on the left (numbered 1) includes those opinions that | 
each judge regards as most appreciative, that on the right (numbered 
11) includes the most depreciatory ; the other piles are intermediate 
and should be approximately equally spaced out in their degrees of 
favourableness. The numbers of statements assigned to the piles 
need not be equal ; but no one pile should contain as many as one 
quarter of the total. 

The distribution of positions assigned to each statement by all 
the judges is plotted, and a curve approximating to the normal 
type is usually obtained. The median position is found, i.e. the 
position above and below which 50 per cent. of the judgments of 
favourableness fall; also the quartiles, and the semi-interquartile 
range, Q. The median position is taken to represent the scale value 
of each statement. If Q, the dispersion of judgments is very large, 
this indicates that the judges disagreed about its favourableness. 
Probably the statement is double-barrelled or irrelevant, and it should 
“be dropped out. 

47. The.median positions or scale values of the first three state- 
ments quoted above were 9:1, 0-2* and 5-1, Showing that they 
are very unfavourable, very favourable and intermediate, respec- 
tively. The fourth statement, presumably because of its double 
implication, had a high О, namely 3-6 ; hence it was omitted from 
Ше um form of the test. The average Q of the statements retained 
was 1-67. 


Next the statements were submitted 


mal consistency tests. 

i еу personally agreed. 

ement an “index of 

the various predictive 

The fifth statement quoted here had 

was dropped. Finall 

selected which passed both these a ich RS Ed 

out so as to cover the whole scale from 0 to 11. : 


en the method of scorin loyed i 
Г 1 g employed in 
Consistency tests and in Thurstone’s tests should be noted. 


* A scale value will 


: be less th. 
assign the statement © a 


the second or hi З per cent. of the judges 


more than 50 per cent. in the eleventh Pile. и 


Uo 


‘apply to 
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In the former the items are of roughly equal “ acceptability ” 
and the score is derived from the епу” of the tide 
Here the items are strongly graded from low to high acceptability 
and the score is based on “level” or “intensity”. We might 
compare the two types to Thorndike’s “ width ” and “altitude ” 
approaches in the measurement of intelligence (1926). 

As in internal consistency tests, so in these scales the favourable 
and unfavourable statements are mixed up.on the test blank. And 
the testee is not informed as to their scale values until he has com- 


pleted his endorsements. 

49. Applications of Thurstone’s technique.— This scaling technique 
has been very widely applied. Scales are available for measuring 
attitudes towards war, pacifism, communism, birth control, the 
movies, prohibition, evolution, negroes, and a number of other 
controversial issues. Uhrbrock (1934) has standardized a scale of 
fifty statements for testing favourableness among factory employees 


towards the management ; the following are some examples, together „ 


with their scale values :— 
(2:1) ow ye got to have a “ pull” with certain people around here to 
е 


get ahead. 

(10:4) I think this company treats its employees better than any other 
à company does. 
(5:4) I believe accidents will happen no matter what you do about them. 

It should be noted that these scales all deal with “ pro” and 
“con” attitudes; but it should be quite possible to treat other 
variables similarly, provided that they can be sufficiently clearly 
defined for easy sorting, and that they can reasonably be regarded 
in terms of “ more or less," ie. as uni-dimensional. This might 
radicalism-conservatism. Similarly Eaglesham's (1937) 
list of ideals in education might be sorted according to the extent to 
which they représent the progressive or the traditionalist school 
of educational thought. This done, testees (e.g. teachers) might 


check those ideals with which they agree, and so be scored for their 
progres: ber of more general 


siveness. There are, however, ,a number, ! 
attitudes which can be dealt with fairly effectively by internal 
consistency, but which wo 


uld not be amenable to scaling. For 
instance it would hardly be possible 


to construct a graded series of 
statements with which to measure tolerance, or scientific-mindedness, 
or aesthetic values. 


G. CRITICISMS OF THURSTONE'S TECHNIQUE , 


50. Reliability of sorting.—The sortirig technique has been 
criticized by Rice (1930) on the grounds that judges with different 
personal opinions might give different judgmerits as to favourableness. 
However, experiments by Hinckley (1932), Ferguson (1935) and others 
Prove that bias among the judges does not affect the score values. 
They would be almost identical whether the statements were sorted 
by ‘ardent church-goers ОГ by atheists. Ferguson's study also 
Indicates that reliable score values may be derived from as few as 
twenty-five judges. Nevertheless, the technique 1$ somewhat 

c2 
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laborious, and it has to be supplemented by the analysis of indices of 
similarity. _Thurstone (1929) has described a method based wholly 
on the actual endorsements of testees, which eliminates the sorting. —— 
He calls this the “method of similar reactions." It does not appear, _ 
however, to have been widely applied. 
51. Rationale of sealing. —Next, there is the objection which we 
raised above ($ 39) to the narrowness of scales which insist on 
testing simple, uni-dimensional, attitude variables. Thurstone 
admits that attitudes are too complex to be fully described by а 
single figure on a linear scale. But he argues that it is quite 3 
legitimate to single out one aspect, such as favourableness-un- 


favourableness from the total pattern and measure it scientifically. 


For we do the same when we measure, say, the height of a table with 
a foot-rule ; and we do not blame th 


€ foot-rule because it fails to give us 
& complete description of all the characteristics of the table. 
Moreover, he would claim that such simplification is justified by the 


»resulting accuracy of his scales, and the equivalency of his units 
of measurement. 


Kirkpatrick 
units. They ar 
because they lack 
not (like a length of so man: 


ji titers (Allport (1935) ; Likert 
(1932, etc.) have docbted whether such refined techniques as 


onceptions as human 


found by Likert (1934) to be impr 
checking, marked very item 1 to 5 


average testee endorsed opinions wh; 
apart, in fact covering two thirds of 
imply very poor reliability. Probab) 
rouse sufficiently definite and dive 
scaling. ^ 

In spite of their different derivations, i 
types of scale yield much the same a would seem that the two 


IP 
correlation of + 0-78 between the Thurstone-Chact ( 


5 
the total scale, The ee 
ly, however, this issue does not 
rse opinions. to be Suitable, PA 


' but 0-30, 0-27 an 
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Church scale and the Allport-Vernon measure of. Religious Values, a 
figure which approximates to the reliabilities of the tests. 

. 58. The conclusion which appears to follow from the above 
discussion is that for most purposes of attitude measurement, the 
internal consistency type of test is the simpler and more satisfactory, 
provided that it is based on a thorough analysis of the attitude in 
question, and that the external judgments of a number of competent 
persons besides the experimenter himself are employed in the 
preliminary planning and selection of items. But that for accurate 
research, e.g. on group differences, or on the modification of attitudes 
by propaganda or other influences, à scale standardized by 
Thurstone’s technique is preferable. E 


Н. RELIABILITY AND VALIDITY OF INDIVIDUAL ATTITUDE 
TESTS 

54. In establishing repeat reliability, the scores of a group of 
testees who have answered the test twice are inter-correlated. Since, 
however, memories of the first application may affect the answers оп” 
the second occasion, it is better to compare two parallel forms of the 
test. More frequently the consistency is determined, by correlating 
the scores on one half of the test (e.g. the odd-numbered items) with 
Scores on the other half (the even-numbered items), and then apply- 
ing the Spearman-Brown correction. If this method is adopted with 
internal consistency tests, the scores of the group upon whom the 
item analysis was performed must on no account be used, since they 
will yield a spuriously high Teliability coefficient ; the test must be 
given to a fresh group. 

55. Conditions affecting reliability.— The reliability of most 
attitude tests is quite high, coefficients of 0-75 to 0-90 or even more 
being commonly obtained (cf. Likert (1932); Lentz (1930, 1934) ; 
Thurstone and Chave (1929) ; Kirkpatrick (1935); etc.). The level 
of reliability is directly dependent on the heterogeneity or diversity 
of opinions among the individuals tested, not (as in group surveys) 
on their homogeneity. In studying the repeat reliability of his con- 
servatism opinionaire, Lentz found that the average testee may 
change his responses to 15-20 per cent. of items ; but that most of 
these neutralize one another, so that the total scores are little affected. 
Over long periods some attitudes seem liable to considerable changes г 
Farnsworth (1937), using Peterson’s Attitude to War scales, obtained 
coefficients of 0-88 when Form В was taken a few, days after Form A, 

d 0-12 when taken 1, (1690) 3? ee later. By 

А nique which Thouless 1 as described it ‘is 
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radicalism rather than to the unreliability of the test as a measuring 
instrument. 4 . 

56. Relation between reliability and validity.—From experience 
gained in standardizing the Study of Values, the present writer has 
been led to the conclusion that it is a mistake to aim at too high a 
reliability in an attitude test, since it may be obtained at the expense 
of validity ; a similar argument is put forward by Kirkpatrick 
(1936). Very high reliability is usually found when a test consists of 
homogeneous material which is repeated many times under identical 
conditions (e.g. simple reaction times). Probably therefore the more 
closely similar in content the items of an opinionaire, the higher its 
reliability. Büt if the items are very homogeneous and very numer- 
ous, the testee can hardly fail to realize that they all refer to his own: 


radicalism, or other attitude, and he may therefore tend to answer 
each item more according to his personal opinion of his radicalism 
than in accordance with 


the way he really thinks or feels about that 

item. Now a valid test of radicalism is surely not one that, in effect, 
^ asks the testee fifty times over whether he regards himself as a radical, 
but one that presents fifty different situationsin which various aspects 
of radicalism may be expressed. More experimental investigation 
15 needed before this thesis can be established ; and it probably 
applies less to some attitudes than to others. For instance, more 
direct questions which evoke the testee’s own opinion of his attitude 


may be appropriate in testing attitude to the 
fatal in the testin, i 


ler t t Stions must be heterogeneous 
and disguised as far as possible ; and therefore the reliability should 
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it privately, tends to make the opinions expressed somewhat more 
conservative. 

58. Defects due to the form of the test.—The difficulties occasioned · 
by the form of the test are much intensified ; for instead of ranking, 
rating or pairing his preferences, the testee has to express a large 
number of complex responses to complex questions in terms of 
Yes, No, or numbers, or a few restricted answers. To expect such 
answers to cover completely every individual's spontaneous reactions 
to the questions is obviously futile. However carefully constructed 
the questions and the provided responses, intelligent testees will 
wish to make innumerable qualifications, though, as mentioned 
above ($94) the less sophisticated will be more amenable to this 
straight-jacketing. Yet the critical testee is not justified in conclu- 
ding that the test is worthless because he cannot always give his 
natural response to each item. For it must be remembered that the 
Score for an individual attitude (unlike the results of most group 
Surveys) is based on a large number of items, and that the mis- 
representations which he finds in particular items are on the whole. 
likely to cancel one another out. Admittedly these difficulties may 
produce errors which tend to decrease the validity of cértain items, 
but the errors are variable rather than constant, and so тау: have 
little effect on the final scores. This weaknéss is, of course, the 
penalty that must be paid if quantitative results are to be obtained ; 


-it would be quite impossible to measure attitudes or to compare 


individuals objectively if each individual expressed his attitudes 
in his own way. Nevertheless, the more care that is given to the 
compilation of items and to the preliminary trying out and discussion, 
so as to cover various shades of opinion as completely as possible, 
the better will be the test. And it would seem well worth while 
leaving space for spontaneous comments at the bottom of the test 
blank, or discussing the items with the testee after he has filled in 
his answers, in order to obtain more insight into his attitude to the 
test and into the significance of his score. That the forcing of 
responses into simple categories does not greatly distort the attitudes 


"was well demonstrated by Stouffer’s experiment (1931). He obtained 


from 238 students anonymous accounts of their own experiences 
and opinions about alcohol and prohibition. These completely 
unforced expressions of attitude were rated by four independent 
judges as to their favourableness or unfavourableness to prohibition. 
The ratings were then found to correlate + 0-81 with the scores 
of the same students on a test of Attitude to Prohibition, the (split 
half) reliabilities of the ratings and the tést being respectively 0-96 
and 0-94. 

59; Empirical evidence of the validity of attitude tests. —We may 
claim then that wheh reasonable precautions are taken in construc- 
ting the test, and in obtaining co-operation from the testees, the 
validity should be good. Апа there is a large amount of scattered 
evidence supporting this conclusion, Moderate correlations are 
obtained between testees’ scores on the Study of Values and 
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imates by their friends of these testees’ aesthetic, economic and 
ос еб (сї. Vernon and Allport (1931)). Thurstone quotes 
similar results with his scales. We would not, of course, expect 
to get perfect coincidence between the attitude which the individuals 
express in the test and the attitudes with which their friends теш 
them. Again Cantril (1932, 1933) has shown that the results o 
the Values test possess a wide predictive value ; e.g. they correlate 
with speed of free word association to aesthetic, economic, etc. 
word lists, with the topics in newspapers that people with different 
interests notice when they scan the headlines, and so on. 

Both this test and a great number of attitude scales have been 
found to differentiate effectively between groups of persons who 
might be expected to possess contrasting attitudes. Women 
consistently come out higher than men in aesthetic, social and reli- 
gious values, lower in scientific, economic and power-seeking. Busi- 
ness, theological, science and art students, etc., obtain appropriate 
results. On Thurstone's Church scale the average score of Roman 
-Catholic students was 2-90, of Jewish students 5-44. Uhrbrock 
(1934) finds significant differences between the attitudes to employers 
of foremen, clerical staff and operatives. On Watson's test (1925), 
Social psychology students in an eastern American University are 
much more fairminded than Middle-West parsons. Vetter (1930), 
G. Allport (1929) and others note meaningful relations between 
radical attitudes and income of the parents, Jewish race, etc.; 
Klein (1925) connects up the attitude with antagonism of the testees 
to their own fathers. Likert (1932) shews that University students 
are much more opposed to negroes 
. G. Allport (1929), Watson (1929) 
and others find that increased knowledge about an issue correlates, 
ore progressive, liberal or tolerant 
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inter-relationships were revealed in Katz and Allport’s (1931) 
study (by means of a questionnaire) of the opinions on various topics 
of some four thousand students. For example, fraternity members 
differed from other students not only in attitudes to college affairs, 
but also in political, racial and religious opinions. The following 
instance is taken from an investigation by Zimmermann (1934). 
Testees were classified into two groups one of which regarded God 
as “© а personal being "" or as “a power making for righteousness,” 
and the other regarded God as “a projection of our social con- 
science ” or as “ the universe as a whole.” In other questions it 
was found that 71 and 42 per cent. respectively of these groups were 
in favour of prohibition, 13 and 26 per cent. in favour of socialism, 
42 and 65 per cent. for birth control, and so on. Apparently some 
general tendency or tendencies are running through the various 
specific responses. These particular results were derived from 
answers to questionnaires ; but much the same is found if a series 
of standardized attitude scales are inter-correlated. Significantly 
large coefficients are almost always obtained. For instance, in, 
Carlson’s (1934) investigation, all the correlations between communist, 
pacifist, atheist, and anti-prohibitionist attitudes were positive. 

61. Object of factor analysis.—It would then be a considerable 
advance if, instead of recognizing perhaps а thousand or more 
attitudes, interests, sentiments and ideals among human beings, and 
attempting the impossible task of measuring each of them in turn, 
we could generalize and abstract a few distinctive and fundamental 
tendencies in terms of which all the attitudes could be classified. 
There might conceivably be quite a small number of basic elements, 
somewhat analogous to the chemical elements ; and the attitudes 
with which we are dealing at the moment might be analysed, like 
chemical compounds, into combinations of these elements. Now 
there have developed recently certain statistical methods which 
claim to be able to accomplish precisely this analysis of a set of 
psychological variables into their underlying components. These 
originated in the work of Spearman (1927) on the fundamental 
factors behind intelligence and other abilities (work which has been 
described in Report No. 53 by Earle and Milner (1929). They have 
since been greatly elaborated by Kelley, Thurstone; Hotelling, 
Holzinger, and other statisticians in America, and by Burt, Thomson 
and Stephenson in this country. It is impossible to describe here 
these highly technical methods, but we will endeavour to outline 
their aims and achievements and, later, their limitations. 

62. Spearman's technique.—All start out from a tavle of the 
correlations* between a series of tests or psychological measurements, 


- which have been applied to the same group of testees. Now a 


correlation between a pair of tests implies that there is something 
common tothem. Spearman found, when working with correlations 


* Burt (1937a) shows that it would be preferable to factorize co-variances 
ы than correlations, and by so doing he is able to simplify Hotelling’s 
ethod. 
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between tests ofabilities, thatall the correlations could be accounted for . 


by the assumption of a single common factor which he called ee 
general intellectual factor. The correlations were, of course, е 

less than unity, and this was because each test was considered to 
depend on a specific factor, s, found in that test alone, in д 
to its dependence on, or “saturation” with g. What is Капен 
as the “tetrad difference ” technique was. developed for dealing 
with data of this type; it is easy to use, though very laborious. 
We cannot however expect psychological tests always to yield so 
simple a picture ; there may often be more than one common ed 
rünning through them. Spearman's technique is less well adapte 

for dealing with such data. Holzinger's (1937) extension of it, 
which he calls the “ bi-factor method” is claimed to be capable 


of isolating factors additional to g; but full details of this are-not 


yet available*. Kelley's (1928) elaborate method depending on 
the “pentad criterion ” 


had the same aim, but it has now been 
discarded in favour of th 


е newer multiple factor analysis techniques. 
“Of these Thurstone’s (1933, 1935) simplified “ centre of gravity 


method is perhaps the most lucid and systematic, and is the most 
widely used at the present time. 
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graph was needed, hence the points were-plotted on the surface of a sphere). 
Thurstone has shown that more expressive factors may. be obtained by 
choosing a fresh set of axes, passing through the same origin as the initial 
ones, but rotated through a small angle. The choice of the most suitable 
axes is guided by logical consideration of the inter-locking tests ; and its aim 
is to produce as many high positive and as many zero loadings as possible 
in each factor, in place of many moderately high positive or negative loadings. 
When the axes have been decided, the new co-ordinates of the points on the 
graph give us the new factor loadings of the tests. 

The three new dimensions, so obtained, afforded a more logical classification 
of student teachers’ abilities than did the initial ones. They could readily 
be identified as: (1) General teaching ability (2) Scientific-logical ability 
(3) Literary-humanistic ability. The full results will be published elsewhere, 
but one or two examples of the “ saturation co-efficients ” of the original 
subjects will illustrate their usefulness. The loadings of speech training with 
the three factors were + -66, -00 and + -24, respectively ; i.e. this subject is 
very important in teaching, depends also on literary, but not on scientific 
ability. Arithmetic similarly gave + -20, + *56 and -00; the intelligence 
test -00, + -30 and + -28. The latter, as has often been found, does not 
affect teaching ability, but is moderately related both to scientific and “to 


literary abilities. 

64. Hotelling’s, Kelley’s and Burt’s techniques.— These techniques 
are more mathematically perfect than Thurstone’s, since they account 
completely for all the test inter-correlations. But they seem at 
present to be less useful to the applied psychologist, in that they 
are more complicated and that it is more difficult to interpret the 
psychological significance of their factors. One advantage of 
Hotelling's (1933) method over Thurstone's is that a testee's scozes 
on each of the extracted factors can be calculated more readily 
(cf. Flanagan, 1935). Kelley (1935) asserts that the bases of his 
or Hotelling's methods are irreconcilably different from Thurstone's. 
But Burt (1937a) shows that they yield much the same results, at 
least in respect of the major factors, only by different routes ; indeed 
the products of a Thurstone analysis may be regarded.as first approx- 
imations to the ideal solutions provided by Hotelling's “ iterative Y 
or Kelley’s *'trigonometrical " methods. Burt's own method of 
“ higher moments ” (cf. Hartog, Rhodes and Burt, 1936) gives these 
same ideal solutions, but with a considerable reduction in mathe- 
matical labour. 

65. Applications of factorization : the general radicalism factor.— 
So far few factorial studies have been carried out with attitude tests. 
Kulp and Davidson (1934) applied a Spearman analysis to brief 
opinionaires dealing with racial, imperialist, international and other 
attitudes, and found good evidence of a general factor running through 
all of them, which they identified with liberalism. Thurstone (1934) 
applied his method to the correlations between eleven of his scales 
and obtained a conspicuous radical-conservative common factor, and 
a second smaller factor which seemed to represent, chiefly, a 
nationalistic tendency. Carlson (1934), Lentz (1934) and others 
have also noted’ fairly high correlations between different tests 
Suggesting a common radical or progressivist tendency. The 
unanimity of these studies is rather striking in view of the many 
doubts that have been expressed as to the existence of a general 
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radical-conservative tendency (сї. $74). We might for instance 
suppose that people could be left wing in politics, conservative 
about morals, and undecided in other spheres. Some are no doubt 
relatively specific in their inclinations, but apparently a fairly 
consistent attitude in all spheres is more common*. The factor also 
correlates positively with educational attainment and, to a small 
extent, with intelligence (thus Thurstone (1934) finds, in a student 
group, a correlation of —0-44 between an intelligence test and а 
scale for patriotism). So far this is the only dimension of attitudes 
and interests which can definitely claim to be established. 

66. Factorial studies by Whisler and Lurie.—We have already 
mentioned (§ 88) Whisler’s (1934) analysis of 31 general attitude 
questions. Six factors were extracted which he identified as 
follows :— 


1. Acceptance of conventional ethical standards 
IL. Enjoyment of fleeting pleasures (or youth у. age) 
^ ПІ. Interest in conflicts and controversies 


IV. Interest in controlling people and manipulating things (or sense of 
power) 


V. Interest in social participation 

VI. Sophisticated, critical attitude. 
As shown above, the significance of these is dubious, since they 
depend so much on the particular test questions which the author 
happened to select. More useful results seem to emerge from an 
analysis by Lurie (1937) of the six types of general interest which the 
Study of Values (§ 80) attempts to measure. Here the selection of 
test questions was based on a logical scheme of classification which 
had been very thoroughly worked out by the German philosopher 
E. Spranger (1998). Four sets of six questions referring to each 
value, 24 sets in all, were factorized ; after rotation of axes, factors 
corresponding rather closely to Spranger's types were obtàined, 
namely:— > й 


І. Social and altruistic attitude (composed largely, though not wholly, 
of items designed for testing the social-humanitarian type) 
IL. Philistine attitude (this combined items from three types, economic 
power-seeking and, negatively, aesthetic items) 
ПІ. Theoretical and scientific attitude 
IV. Religious and spiritual (as o; 
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statistical investigation is a valuable corollary to the logical analysis, 
and that it helps to show which attitudes are, or are not, sufficiently 
distinctive to merit separate measurement. 

67. Difficulties in the identification of factors.—We will return to 
a critique of factor analysis in Chapter VII, when we have seen 
the results that it yields with tests other than attitude scales. But 
We must note here one of its weaknesses, namely the difficulty in 
interpreting the psychological meaning of the factors. Almost the 
only way we have of identifying them is to examine which tests or 
test items are most highly saturated with them (this was done, 
above, in factorizing the student teachers’ marks). Naturally we 
must not expect a factor to correspond precisely to any one con- 
ception or trait with which we are familiar in everyday descriptions 
of human beings. It is likely to cut across such conceptions, since 
it is in essence an abstraction from tests of a number of different 
traits. For instance, Thurstone was doubtful whether the common 
factor in his attitude scales ($65) represented primarily a radical 
or an anti-religious tendency ; possibly it should be interpreted as 
a blend of the two. Not infrequently a factor appears to involve 
an assemblage of traits so meaningless that one finds great difficulty 
in believing that it really corresponds to ary basic personality 
tendency. The following is a father extreme example, namely, 
Whisler’s fifth factor which he considered to involve mainly social 
participation ; actually it is composed of the following cluster of 
test items :— ° 

Rarely thinks. about the meaning of life 


Does not enjoy spicy and highly seasoned foods 
Much interested at a play in whether the characters violate conventional 


codes of behaviour 
Prefers working with people to working with things or materials 


Considerably interested in politics 
Daydreaming has increased during the past five years 
Differs considerably from intimate friends in interests, etc. 


Copeland (1935) has criticized the laxness of the psychological 
side of factor analysis, contrasting it with the meticulousness of the 
mathematical side. He suggests that decisions as to the meaning 
of the factors should be based upon group judgments instead of 
merely upon the investigator's personal interpretation. 


(ii) Factor Analysis Applied to Correlations between Persons 

68. Burt (1937b) has recently pointed out an alternative approach 
‘to correlational and factor studies, based oii the resemblance between 
persons instead of between tests. Supposing that fifty persons each 
take fifty tests, it would be just as easy to ca culate the correlations 
between the scores of each pair of persons on all the tests as to perform 
the more usual calculation of correlations between all the testees' 
Scores on each pair of tests. And the resulting 1225 correlations 
Might equally well be submitted to Thurstone’s or some other 
multiple factorization technique. Stephenson (1936 a, b, c, d) calls 
this the “ Q-technique," or “ inverted factor analysis ” (the latter 
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title is, however, somewhat misleading). He has carried out by 
means of it a number of interesting studies of personality. 


It is somewhat difficult at first to grasp the implications of this approach. 
But consider a pair of testees who are found to inter-correlate very highly ; 
they resemble one another closely because they obtain much the same reativa 
scores on the fifty tests. And ifa common factor is found which will accoun: 
for a number of the inter-person correlations, this will mean that all these 
persons approximate to a common type, since they all have similar relative 
scores. Again, just as ina straightforward factorization of tests, some tests 
are found to be highly saturated, some poorly saturated with a factor ; so Ш 
this “ inverted '' factorization, some persons will be found to approximate 
to the type more closely than others. If the analysis is continued a second 
type (i.e. factor) пау emerge to which a n 
persons whose test scores are all simil; 
of the first type. 
likely to fit smaller numbers of persons ; just as in ord 
the third, fourth, and later fac 


to the straightforward factorization of tests. 
an individual on a set of test-factors should: 
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Department of Psychology, University College, on attitudes to 
Pictures, also indicate the existence of distinctive types of appre- 
ciation. Some of this work will be presented in a forthcoming 
article by Dewar (1938). 


(iii) Conformity—Atypicality of Opinions 

71. A large variety of tests have been based on the extent to 
which a testee’s responses approximate to or deviate from some 
predetermined standard responses. Thus Deutsch (1923) measured 
conventionality or conformity by giving a list of questions with 
multiple choice answers, one of which was the conventional answer 
(e.g., one question dealt with methods of disposal of the dead). The 
proportion of conventional answers checked by the testees was found 
to correlate well with assessments by acquaintances of the testees’ 
conventionality. Artistic or musical taste is almost always measured 
by presenting reproductions of contrasted works of art (photographs 
of furniture, selections from poems, gramophone records of music,.« 
etc.) ; these works have previously been judged by art experts, and 
the number of times a testee's preferences agrees with the experts' 
preferences is taken as an index of his taste. The validity of these 
tests is somewhat dubious for reasons which the writer has discussed 
elsewhere (1935). Barry (1931) attempted to test “ compliance ” 
or "suggestibility "; he first gave an opinionaire dealing with а 
variety of topics to a group of students, and later repeated it telling 
the students, before they marked each item, what was the previous 
commonest response in the group. He then summed the numbers 
of itéms to which they altered their first responses in the direction 
of conformity with the group response. Individual differences in 
susceptibility to propaganda have been inyestigated similarly. 

79. Ethical discrimination tests.—Innumerable tésts of moral 
judgment or ethical discrimination have been devized, the earliest 
being Fernald's (1912). A good description of them may be found 
in Symonds (1931, pp. 268-285). Here the testees register their 
opinions on moral issues and are scored according to the conformity 
of their responses to some arbitrary ethical standard. Any of the 
techniques already described may be adopted :—voting on the 
blameworthiness of various crimes ; ranking the ten commandments 
in order of importance ; true false tests, e.g.-— 

Good marks are chiefly a matter of luck A True False: 

Clean speech is а sign of being а goody-goody True False 
multiple choice items, e.g.— Ы 

If someone steals your lunch, you should : p 

Steal another lunch to even it up 

Report it to the teacher S 

Cry about it ; 

Say nothing about it. A 
Similarly, moral stories, or pictures of good and bad deeds, may be 
presented together with several alternative outcomes, the best of 


Which is to be chosen. 


о 
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78. Reliability and validity of ethical discrimination tests—The 
reliability of these tests is generally high, so long as they contain a 
large enough number of items, but there is grave doubt as to what 
they measure, i.e. their validity. Hartshorne and May (1928), who 
devized and tried out a very ingenious series of them, found that 
different moral judgment tests gave positive though rather low 
correlations with one another ; but that there was scarcely any agree- 

.ment with tests of honest or dishonest behaviour. Others have 
demonstrated that the moral judgments of delinquent children tend 
to be just as "correct" as those of non-delinquents. Similarly, 
when 200 prisoners and 272 school teachers arranged 45 criminal 
acts in order of seriousness, the two lists were practically identical, 
showing that deviations from the teachers’ list cannot be taken аз 
an index of criminality (cf. Simpson, 1934). Many investigators 
find a significant correlation between moral judgment scores and 
intelligence tests, which Suggests that the responses of delinquents 

„ате determined more by their comprehension of society’s moral 
conventions than by their own moral habits of behaviour. But 
why should there be so much less relationship between attitudes 
and behaviour in this instance than in most of the tests we have 
described in earlier sections? We would suggest, though we cannot 
yet prove, that the difference is mainly due to the different 
Einstellung among the testees. They realize, or at least the more 
intelligent ones realize, what the tester is trying to measure, and 


assume that it may be to their advantage to give conventionally 


moral responses. Thus the results of these tests supply a valuable 


Í deducing conduct from verbal 


me importance of the attitude with 
test situation, 


which the testees approach the 
(iv) Extreme versus Moderate Opinions 


74. Psychological conception of extremeness v. moderateness.— 
F. Allport and Hartman (1925), Vetter (1990) and (193 
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(cf. § 17) studied individual differences in certainty (i.e. extremeness) 
of religious and other beliefs, hoping thereby to obtain a measure of 
Irrational thinking. 
. 25. Experimental justification.—Unfortunately, experimental 
justification for this conception is as poor as the justification for the 
radical-conservative continuum is good. Testees do vary widely 
in extremeness on many tests, but fail to do so consistently. The 
present writer has often found scores among students ranging from. 
10 or 20 per cent. to 80 or 90 per cent., where 0 per cent. represents 
the greatest possible moderateness, 100 per cent. greatest ex- 
tremeness. Yet the correlations between these scores, derived from 
different attitude tests, were generally negligible: Moreover, there is 
little proof that these variations correspond to any distinctive 
psychological trait, with the exception of sex—men usually giving 
More extreme opinions than women. Watson’s measures do agree 
fairly closely with his other tests of prejudice ; and in one study the 
writer found low positive correlations with tests of impulsiveness- 
caution. Allport and Hartman (1925) claimed to have found 
personality differences between extremes and moderates, but their 
results were probably not very reliable statistically. In another 
experiment the writer compared four different measures of extreme- 
ness with his testees’ results on the Boyd Personality Questionnaire 
($ 140), but failed to find any consistent personality correlates. We 
inust conclude then that though this conception is psychologically 
promising, much more rigorous experimental investigation of its 
reliability and validity is needed. 
(v) Variability of Opinions 

76. Harper (1927), Lentz (1930, 1934), and Telford (1934) have 
studied individual differences in the numbers of altered responses 
when the same opinionaire is given a second time after a three or 
four weeks' interval. With a long test these change scores are reli- 
able, but, just as with extremeness, the scores derived from different 
tests are rather inconsistent. Though there appears to be no corres- 
pondence between variability and general instability of personality, 
there is a slight negative relation with intelligence (the more intelli- 
gent being more consistent), and, according to Harper, with liberal 
opinions. 


IV.—ASSESSMENT OF HUMAN TRAITS BY RATINGS 
А. INTRODUCTION è 

77. In everyday life we are continually making judgments of 
one another’s character and temperament traits. Not merely in 
ordinary social intercourse, but also in education, industry and the 
professions, we realize that these traits are fully as important as 
intellectual or other abilities. But ourjudgmenis are often haphazard 
and biased, based on quite inadequate knowledge of the people we 
judge. The various rating methods Lave developed in an attempt 
to render them more objective, more systematic, and therefore 
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more useful both for practical and for research purposes. No 
accurate and easily applicable tests are available for the assessment 
of most personality traits, so that we are forced to rely very largely 
on ratings. And ratings actually possess a considerable advantage 
over such personality tests as have been devized, in that they can 
often be applied without the knowledge of the ratees (Le. the persons 
rated), whereas the person who knows that he is being tested can 
hardly be expected to exhibit his normal emotional characteristics. 
It is probable then that, next to intelligence tests, ratings are more 
widely used and have been subjected to more thorough study than 
any other psychometric technique. 

78. Uses of ratings.—In many American business concerns, 
colleges and schools, it has become a matter of routine to check up 
on the personality characteristics of the personnel, students and 
children by periodic ratings. Though less developed in Britain, 
we may note their habitual employment in the National Institute of 


Industrial Psychology's vocational guidance service, and their 


adoption in the important researches of Galton on imagery, Pearson 
on intelligence, Burt on delinquency, and of Webb. 


Son and others on character. The ordinary school report about a 


s —Ratings of 
y traits and responses to an attitude test are closely 
similar, not only in the technical methods they employ, but also 
in the psychological processes the 


1 ) y involve. Rater A's jud ents 

of B's personality are expressions of his attitu dus 

E UM way i his boues of items in an opinionaire express 
гасіса! ог other attitude. Generally s eakiı i 

as a source of information ab E 


E out B (the rate 
Ио RE А е) rather than about A 


( er), whi est is applied mainly for the 
information it gives us about A. Nevertheless. as we shall Se below, 


it is important to keep in mind this де 
: nt to pendence of the assessments 
оп A's affective judgments as well as on B's characteristics. 


B. RATING TECHNIQUES 


ns.—Ranking or paired com- 
f ratees.is small, in just the 


À 5 Very careful definiti f the 
trait, according to the princi 18 ben Sn GN UT 
these techniques are o be ОА d below ($89), is essential it 
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When a number of individuals are to be rated on several traits, 
and when none of the available raters is acquainted with the whole 
&roup of individuals, an alternative procedure is recommended by 
Burt (1937b), namely, that the raters should rank the order of 
Prominence of all the traits in each individual, instead of ranking or 
rating all the individuals on each trait. He claims to have obtained 
More reliable ratings by the former than by the latter method (cf. 
Burt, et al. (1926). “Most people find it far easier to think of a 
Person’s character as consisting in a pattern of tendencies differing 
in relative strength rather than in a sum of isolated traits to be 
assessed on a normal curve." The present writer has also found this 
method useful, but it does not seem to have been adopted elsewhere. 

81. Voting Techniques.— The voting technique (cf. § 2) is exempli- 
fied by “Check List" and * Guess Who” ratings. In the former a 
long list of traits is provided and the rater checks those which appear. 
to him to fit the testee. This is too crude a method unless there are 
very large numbers of raters; it has, however, been extended and 
applied in May and Hartshorne’s (1930), Thurstone’s (1934), and 
other investigations. May and Hartshorne’s (1930) “ Guess Who ê 
technique is especially suitable for use with school children. A series 
of short character sketches is drawn up, €.8. word pictures of a very 
selfish, a moderately selfish, an average and an unselfish child. These 
are given to all the members of the class, who are told to fit them to 
any pupils they seem to describe, i.e. to guess whom they represent. 
Owing to the large number of raters, a pupil’s score for selfishness 
is readily obtained from the number of times each sketch is assigned 
to him ; and the scores show high reliability. 

89. Numerical ratings.—Much more frequently adopted are 
techniques which regard a trait as а quantitative variable, and 


assign to each ratee a certain low or high score on this variable. - 


Giving marks out of 20 to one's friends fon Beauty, Humour, etc. 
used to be a popular parlour game. We know now, however, that 
the average rater cannot properly discriminate twenty grades of а 


scale (cf. 8 90) is used ; when the raters are numerous, or when they 
are insufficiently interested in the rating or insufficiently trained to 
make finer discriminations. > 6 
83. Errors in numerical ratings.—We have already dealt (58 3-5) 
with the distribution of ratings, with the correction of errors,in their 
average and their dispersion, and with the methods ої combining 
the judgments of several independent raters. Mos; raters are with 
difficulty persuaded to use the extreme steps in their judgments of 
people ; some are more cautious than uthers—hence the desirability 
of-specifying the proportions of ratees to be assigned ќо each step. 
he error in average level is also very prominent owing to an inveter- 
ate tendency to leniency amongst almost all raters. Unless specially 
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í trained, they put far too may ratees above the average on any 
desirable trait, too few below, apparently regarding an average or 0 
rating as something discreditable. On account of these various 
difficulties which raters find in using a numerical scale consistently, 
two alternative techniques have been devized—the Man to Man 
Scale and the graphic scale—the second of which is now almost 
universally employed. 

84. Man to Man seales.—This type of scale was developed by 
Scott (1923) in 1917 for use among American army officers. In 
rating, say, leadership, each rater was told to think of the officer A. 
whom he regarded as highest in this trait, then another E who was 
lowest, another C half-way between, and others, B, D, halfway - 
between A-C and C-E. These five names were retained аз а private 
yardstick. In rating any officer X, the rater had to judge which of 
the five X most closely resembled in leadership. Similar yardsticks 
were constructed by each rater for several different traits. The 

< method is concrete and provides the rater with a fairly permanent 
Set of standards. But it is somewhat cumbrous, and has the dis- 
advantage that each rater's standards are different. 

` 85. Graphic scales.— This type, first suggested by Freyd (1923), 
attempts to establish the same standards for all the raters, by 
careful definition of every step on the scale. The following is an 


example from a scale to be used by college tutors in rating their 
students. 
_ Does X need constant prodding, or does Не go ahead with his work without | 
being told ? 
| | | | | | | | | | | 
Needs much Needs Doesordin- Completes Seeks and 
prodding in occasional ary assign- suggested sets for 
doing ordin- ,, proddiag ments of his supplement- himself 
ary assign- own accord ary work additional 
ments ө, tasks 
The graphic scale i 


g for untrained raters. _ 
quantitatively; he simply 4 


1 experi: iti 
ој езе р menter can measure off the position of 
found that five inches is the 

crowding of the description: 

whole. By an appropriat 

tendency ” (tendency not 

leniency can be 


ne, since it obviates 
S, and yet can readi 
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€ choice of descript he Ве 


ions, the ''central 
partially controlled ( 
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p cf. Symonds (1931), Guilford 
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. 86. Two further precautions may assist in securing more careful 
judgments from the raters. They should rate all the ratees on one 
trait at a time, not rate each ratee in turn on all the traits. And the 
desirable and undesirable ends of the scales should be placed ‘at 
tandom on the left- and right-hand sides of the page in successive 
traits. Uhrbrock (1932) even recommends “scrambling” the 
descriptions of the various steps in a graphic scale, for example :— 


Very, good so as to force the raters to read all the 
аш descriptions before they decide on their rating. 
Excellent б д 

aoe This plan is however not generally adopted 
Good except in the following, Thurstone-type, scales. 


87. Rating scales in equivalent units.—As with attitude tests, the 
numerical ratings obtained by the above techniques are not scaled 
in equivalent units. We usually assume that the distances between 
successive steps are equal, though we have no justification for so 
doing. Given sufficient raters and ratees it is possible to establish 
a rational scale by Thurstone's methods (§s 11-18, 45-48). For 
instance Richardson and Kuder (1933) constructed a scale of SF 
statements for rating the efficiency of Proctor and Gamble's salesmen, 
a scale which they claim to be highly reliable. The following are 
sample statements, with values on a scale of 0 to 8 units. 

(6:9) He is making exceptional progress 

(3-2) He is somewhat in a rut on some of his brand talks 

(5:6) He tends to keep comfortably ahead of his work schedule ^ 
Any of the statements that are thought to apply to the ratee are 
checked, and their average scale value gives his efficiency score, in 
scientific units. Willoughby's Emotional Maturity scale (cf. $ 187) 
is similarly standardized. Most investigators, however, seem to 
regard such refinements as unnecessary. 

88. Social maturity scales.—Scaling according to age level is 
used by Bridges (1931) in her rating scales for emotional and social 
development of pre-school children, and by Pollin his recent Vineland 
Social Maturity Scale (1936 ab, 1937). (It should be pointed out 
that psychological age units are not equivalent). The Vineland 
Scale contains 117 items, such as :— 


Reaches for familiar persons 
Dries own hands 

Is trusted with money 
Makes telephone calls 
Provides for the future 


It has been proved that the first item represents а level of maturity 
attained by normal (American) 4 month qld babies ; the second by 
21 year old children ; and the third, fourth and fifth by 54, 103 and 
25 years respectively. While the scale can be used for any investi- 
gation^involving assessments of social competence, it was first 
devized for the purpose of grading feeble-minded jfersons who, as is 
well known, often differ more from normal persons ixsocial than in 
intellectual maturity. Though the Stanford-Binet test is invaluable 
for measuring the mental age (M.A.) of defectives, Чї needs to be 
Supplemented by assessments of their emotional ard social develop- 
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ment. The scale is applied by a trained examiner, who obtains the 
necessary data from a competent informant in an interview ; he 
checks the items in the scale that apply to the patient and then 
works out the Social Age (S.A.) and S.Q. in the same way as a 
Stanford-Binet M.A. and LO. Excellent reliability is claimed for 
the scale (cf. $ 284). 


C. Traits To BE RATED 


89. Avoidance of ambiguous traits.—In the early days of rating 
it was common to find very general traits such as Leadership, 
Efficiency, Honesty, etc., being assessed. The defect of these is that 
they require so much interpretation on the part of the rater; 
different raters may attach a variety of different meanings to them. 
Hence such scales tend to show poor reliability, or lack of agreement 
between raters. Numerical or letter ratings, or adjectives such as 


. “ very high, superior, poor,” etc., are also liable to produce confusion 


and variation of standards. «American textbooks always insist 
therefore that the traits to be rated and the steps on the scale should 
be as specific and objective as possible. Each set of ratings should 
deal with a single type of behaviour which has been actually observe 
by the raters and which can be assessed reliably. Adequate defini- 
tion of a trait or type of behaviour is not achieved merely by listing 
several synonyms, but rather by describing in easily understood 
terms concrete situations in which the trait is expressed to various 
degrees. And impersonal situations (where the ratee's trait in some 
way affects the objective environment) are preferable to personal 


ones (where he only creates an impression on his social environment). 


According to Hollingworth (1929), some traits are rated worse than 


others. Character and moral qualities such as Courage, Unselfishness 
and Integrity are far too equivocal ; 
Quickness and the like are relativel 
reliably. Tt is generally believed, 
all traits are obtained by the use 
90. Analytic scales. 
efficiency, we split it into the 
efficiency, and have thes 


1 y derived originally from 
у ations of items of behaviour; and by using 
analytic scales his general impression is dissected so as to et at the 
facts on which it.was based, i 


ге , the items of an analytic rating scale are 
eventually recombined to give a total mark for the general trait 
and they maw be weighted according to théir relative importance 2° 
components of the trait. 
statistically consistent with t 
independert of one another. 
the whole field covered by t 
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91. Limitations of Behavioristic approach to ratings.—The _ 
emphasis on Behaviorist principles in ratings, which is reflected in 
the last two paragraphs, is perhaps exaggerated among American 
psychologists. Certainly it is desirable to aim at concrete descrip- 
tions of the steps on our scales, but we cannot hope to eliminate all 
interpretation. There is, indeed, no definite dividing line between 
subjective interpretation and objective observation; the rater's 
recollections of the ratee’s actions will always be coloured by his 
general emotional impressions, and his notions of the ratee's general 
traits will always be mixed up with memories of particular actions. 
In the following discussion we will attempt to give the evidence 
for this standpoint, and to show its bearings Ón the practical 
problems of rating scale construction. 

First it should be pointed out that Behaviorist definitions are 
very seldom achieved. The graphic scale for " Working without 
prodding,” quoted in $ 85, is indeed fairly objective ; but тазу 
Scales include highly interpretative items. The following, from 
Laird's (1925) Personal Inventory C 3 for introversion-extraversion 


is an example :— А 

Has X (in the past 
few months) = 
experiencedfine easily sympath- not ` often unsympa- 
sentiments and moved to ised especially unmoved thetic 
emotions ? tears readily sensitive 


92. Rating of observed or inferred behaviour.—Let us consider 
the claim that ratings will be useless unless the raters have had 
opportunities of observing and collecting relevant evidence about the 
type of behaviour to be rated. For instance, a teacher should not 
be asked to rate a child’s Physical Agility, nor his parents to rate his 
Concentration of Attention. This principle appears to be sound, and 
yet it neglects the tendency of human beings to express many 
facets of their personalities in everything they do, and the tendency 
of judges to interpret the person as a whole rather than merely to 
observe his specific behaviour. Newcomb (1931) has demonstrated 
that ratings on observed items of behaviour are but little superior 
to ratings on behaviour which has only been inferred. In his investi- 
gation at a boys’ summer camp, careful records were kept of the 
actual behaviour, and the accuracy of subsequent ratings was 
checked against these records. The accuracy of ratings on observed 
items was represented by correlations of + 0-54 -- :09 and + 0:45 
4:10, on unobserved items + 0:39 4۰11 and + 0-40 + ·10, 
figures which are very little lower. 

93. Reliability of different types of ratings.— The critezion most 
frequently applied for determining the goodness of ratings is the 
reliability (consistency), i.e. the extent of correlatipn between differ- 
ent raters’ judgments of the same traits. Actually this criterion 
does not always show conclusive superiority for analytic scales over 
ratings of single general traits. In several studies weported in the 
literature the consistency of analytic scales wes represented by 
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correlations of about + 0-50 or less; though some more recent 
scales have yielded better results, namely, coefficients of ++ 0-70 or 
more. In early researches by Burt (1915), ratings were obtained 


on MacDougall's list of primary emotions; also systematic records | 


were compiled of specific manifestations of these emotions. The 
consistency of the former averaged about + 0-45 to + 0-50; of the 
latter + 0-60 to + 0۰70. Newcomb’s investigation, mentioned 
above, showed much the same consistency among ratings of observed 
and of inferred behaviour. However, the significance of this criterion 
is far from clear, since а high correlation may merely show that the 
raters happen to have common prejudices about the ratees (cf. § 106), 
not that their judgments are true ones. 

94. Wolf and Murray’s investigation of ratings.—Some relevant 
evidence was obtained in a recent investigation by Wolf and Murray 
(1936). They made an extremely thorough study of the person- 
calities ‘of 28 students. Each student was first rated on forty traits 
by five judges after a 45 minutes’ interview, then tested or analyzed 

` by various investigators for a total of 36 hours, and finally re-rated 
on the same traits in the light of all the information which had 
accumulated. Taking these final ratings as criteria, the initial 
ratings showed the, fairly high average validity of + 0-63. But it 
was found that those traits on which the initial raters most closely 
coincided were not necessarily the most validly rated, though there 
een consistency and validity. Next, 


whole, however, tr: 
which were best е 
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child's. behaviour consisted of solitary play, onlooker behaviour, 
organized group play, or other defined categories of sociality. A 
child’s score was derived from the total number of minutes (out of 
60) in which he had been engaged in each category. This and other 
similar studies generally discover remarkably high reliability ; 
a child’s social participation score on odd-numbered days correlates 
closely with his score on even-numbered days. Olson (1929) has 
applied the method to older children at school, recording their 
nervous habits such as nail-biting or tics, and their whispering, 
during specified time samples. Thomas, Loomis and Arrington 
(1932) have carried out extensive observations in a garment factory, 
recording the activities of an operative at 5 second intervals con- 
tinuously for 5 minutes. By confining themselves, however, to a 
Behaviorist classification of activities (“ Material, Self, Contact with 
Persons, Non-job Activities, and Language "), they seem to have ı 
obtained results of purely technical, not psychological or practical, 
interest. Nevertheless, time-sampling might be profitably extended 
to industrial investigations of work concentration, talkativeness, 
posture, etc*. In addition to its accuracy, it has the merit of 
measuring spontaneous and natural behaviour in the kindergarten, 
school or factory environment, without any of the artificial restric- 
tions that are so often unavoidable in psychologital tests or laboratory 
experiments. The testees need not even know that records are 
being taken if suitable one-way observation screens are used. Я 

96. Reliability (consistency) of observational records.—Our imme- 
diate concern with these techniques lies in their consistency. In 
general the correlation between two observers who make simul- 
taneous records is decidedly higher than the consistency of ordinary 
ratings. Moreover, Thomas claims that observations of highly 
specific objective activities (e.g. total number of contacts with other 
children) are more consistent than those of activities which involve 
interpretation (e.g. number of social contacts). But such specific 

. fragments of behaviour have very little relevance for personality 
until they are interpreted. Total contacts may be very accurately 
measured, but they are psychologically meaningless. And Good- 
enough (1928, 1930) shows that social participation, leadership and 
the like can be accurately recorded if they are sufficiently clearly 
defined, and if the recorders are sufficiently trained. 

97. Conclusion.—These various researches, together with others 
described below, appear to lead to the conclusion that ratings will 
be somewhat improved by careful and concrete definitions of the 
traits or types of behaviour, and that it is probably worth while 
substituting for each general trait a short series of relatively, objective 
component scales. But unless we confiné ourselves to techniques 
such as Thomas’s, we cannot hope to get strictly factual ratings, 


- - 

* Such techniques are, in effect, already employed by industrial psycholo- 
gists in studying output and accidents. ' The discovery of Farmer and others 
that records of accidents sustained during a:sample period Gf a year are highly 
Predictive of subsequent accident rates, is a good instence. 
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freed from all subjective interpretation. Thus a very detailed 
Behavioristic dissection of a trait into an analytic scale of many 


items is unnecessary, and does not confer any advantage to compen- 


sate for the great increase in labour which it demands from the 
raters. 


D. INFLUENCE ON RATINGS oF EXTENT OF 
ACQUAINTANCESHIP 


98. Ratings based on photographs or the voice.—Many studies 
have been made of the accuracy of ratings when the amount of 
information concerning the ratees is severely limited. For instance it 
is important tó know how much can be deduced from facial expression 
alone, as portrayed їп а. photograph, how much from a brief interview, 
and so on. The standard method is to obtain ratings on several 
traits from a group of judges who have interviewed, or looked at 
„photographs of, the ratees, and then to correlate their judgments 
with ratings on the same traits obtained from close acquaintances. 

- Alternatively intelligence may be rated and the judgments compared 
with intelligence test results, or judgments of vocation compared with 
actual vocation. Similar judgments of traits, vocations, etc., may 
also be based on hearing the voices of tke ratees. The results with 
photographs or the voice alone are uniformly poor (cf. Pintner (1918) ; 
Landis and Phelps (1998); Hollingworth (1929); Taylor (1934), 
and a dozen other studies). Judgments of intelligence or of character 
traits tend to give zero correlations, unless the ratees are a specially 
selected group. Burt (1919) shows however that emotional and 


physical traits, such as Humour and Muscular Energy are slightly 
better rated. 


they are forced 


1 to ^ompress the: 
or the voice 


5 their deductions from the photographs 
К Burt Б method of ranking traits within the individual 
ratee ($80) and May's “Guess Who” technique (§ 81) are two attempts 
to substitute more natural forms of judgment, which appear to possess 
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advantages over the ordinary rating techniques. А third possibility 
is known as the matching technique. 

. 100. The matching technique.—Here, for example, the judges are 
given brief sketches of the personalities of the individual ratees, and 
instructed to identify or match these with the appropriate photo- 
graphs or voices. Several continental psychologists—Binet, Arnheim, 
and Wolff (whose work is summarized in the writer's article, 1936)— 
obtained high proportions of successful matchings by this technique. 
The writer has devized a statistical method for expressing the 
validity of such results in terms of contingency coefficients, the 
application of which indicates that judgments of personalities 
considered as integrated wholes are indeed supezior to judgments 
of separate traits in а group of persons. Thus in Allport and 
Cantril’s (1934) experiments on the voice, judgments of separate 
traits or interests yielded an average validity contingency of + 0-27; 
but when the same lists of traits were combined into brief character 
Sketches, the matching of these with the same voices gavê а 
contingency of + 0-41; (the probable errors of these coefficients 
were approximately 0-01 to 0-02, so that the superiority is certainly 
significant). The matching technique is most suitable for the study 
of somewhat tenuous indices of personality, such as photographs, 
the voice, gestures, and artistic style. Its extension to judgments 
of more complex material has been but little explored. 

101. Ratings of dynamic appearance : interviews.—The living 
and moving features naturally offer much more scope for judgments 
than do photographs. Cleeton and Knight (1924) found low positive 
correlations between judgments of students sitting, silent, on a 
platform and criterion ratings supplied by their acquaintances. 
Estes (1937) obtained similar results from ratings of motion pictures, 
and these were somewhat improved when the method of matching 
with character sketches was adopted. 

In interviews, many additional deductions may be made from the 
ratee's answers to questions ; hence we usually, though not invariably, 
find better validity. Wolf and Murray's result has already been 
quoted (§94). Burt (1919) obtained coefficients averaging + 0-33 
between ratings of children atter brief interviews and their teachers’ 
ratings on the same traits. This figure should be'contrasted with 
the average of + 0۰12 obtained from ratings of photographs. 
Magson (1926) found scarcely any agreement between judgments 
from interviews and tests or ratings of intelligence, but ascribed 
this partly to the diversity of interpretations of the meaning of the 
term intelligence. Hollingworth (1929) has supplied a devastating 
demonstration of the inability of experienced business men to agree 
upon the applicant who would be. most suitable for a particular 
job. Twelve experienced judges interviewed eech of 57 candidates 
and arrived at entirely discordant conclusions. VZebb (1915), on the 
other hand, found as high agreemen? between rating$ by interviewers 
hers who knew the ratees well, as between the 


and ratings by teac V 
interviewers among themselves or the teachers among themselves. 
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: i ï temperament and personality on the basis of 
е pede estin and interviewing.—Still more data are 
available when the ratees are given something to do during the 
interview. If they are set some baffling task, their persistence, 
reactions to success and failure, etc., may be very revealing. _ This 
is the situation in the individual application of the Binet-Simon, 
or performance, tests of intelligence. Burt (1926), Stutsman (1931), 
Anderson (1929), the present writer ( 1929) and others have descri ibed 
the many qualities that may be observed under these conditions. 
The writer has prepared a graphic rating sheet, listing these qualities, 
which he has found useful in interviews and individual tests with 
adolescents and adults ; it is reproduced on pp 56-57. 


Burt and the writer find good consistency between different 
observers who judge the same testees under these conditions ; 
and the writer has obtained an average validity coefficient of 3 0-50 
hy comparing observers’ ratings on several traits with a series of 
independent measures of these traits, Though this figure is only 
moderately high it compares well with the validity of ordinary 
ratings by associates (cf. $110). A much higher figure can hardly 
be expected, since the testing situation is necessarily different from, 
and more restricted than, situations of everyday life. Many testees 
may be upset by its novelty and fail, for instance, to persevere at 
the tasks set them, although usually-very persistent at things which 
interest them elsewhere. 

103. Ratings in vocational guidance interviews. 
guidance techni 
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an informal disc 
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104. Ratings by acquaintances.—Ordinary ratings are not 
usually based on any immediate observations, but rather on the 
raters’ recollections of all their past experience of the ratees. Often 
they will have seen the ratees in much more varied situations and 
so have a much wider range of data to draw on than in any of the 
investigations we have been describing. There is still of course 
some restriction; for acquaintanceship is generally confined to 
one phase of people’s lives. For instance, the foreman has little 
Opportunity to observe the factory worker’s home life, and the rela- 
tives do not know how he behaves at the factory. We have, however, 
already pointed out the weakness of the view that»raters can rate 
accurately what they have actually observed, and nothing else 
($991). “Subjective generalization and interpretation probably 
play a large part even in estimates of objective characteristics of 
the ratee’s behaviour. We tend to see people and to note their 
actions through the distorting spectacles of our own affective 
reactions to them. In an interesting study by Landis (1925) of the 
reasons which raters give for their judgments, it was found that the 
ordinary rating is not based on definite past observations, but rather 
on a general impression of a diversity of experiences. The reasons 
were usually very vague, often bizarre ; they might well be termed 
rationalizations, in the psychoanalytic sense. 

105. Conclusions as to the effects of acquaintanceship.—We need 
no longer be surprised that prolonged observation and close acquain- 
tanceship do not necessarily improve ratings. The disagreements 
between different acquaintances rating the same individuals are 
very striking. According to Symonds (1931) the typical correlation 
between the judgments of a pair or raters is about + 0:55. Though 
higher figures are sometimes obtained, they may often be lower if 
the raters are untrained, the scale badly prepared, and the traits 
obscure. The preceding Sections do indeed indicate that an increase 
in the amount of data concerning the ratees generally increases 
their accuracy up to a point. But the results of Landis (1925) and 
Shen (1925) show that intimacy of friendship has no effect, and 
Knight (1923) found evidence of poorer judgments of ratees who had 
been longest known to the raters. Apparently a relatively superficial 
and detached knowledge may be superior because prolonged famili- 
arity. tends to make the raters stereotyped and biased in their 
attitudes. It is sometimes recommended that raters should carefully 
study the ratees for a period before they make their judgments, 
keeping in mind the traits to be rated. ` (This plan was adopted 
by Webb (1915)). Though certainly desirable in so far as it can be 
carried. out with impartiality, yet we must be prepared for the 
Tesulting increase in knowledge to be offset by th? development in 
the raters of greater prejudice. SN @ 
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GENERAL RATING SCALE FOR QUALITATIVE OBSERVATIONS DURING 
TESTING AND INTERVIEWING 


Мае укыу ese acne. Date 27 Examiner ......... 

ACTIVITY 
Excited, restless, unable to keep still Impulsive 
Quick and vivacious Stable 
Calm and deliberate Cautious 
Inert and listless Inhibited 


Poses, motor attitudes 
Tics ............. ro EOE 
Fiddling with 

material 
Peculiar 

» expressions 
MOVEMENT 
Fluent and graceful 
Accurate and well-controlled 


Quick stride and movements 


Angular and awkward 
Clumsy ^ 


PHYSIQUE AND BEARING 
Impressive in bearing 
Satisfactory impression 
Unimpressive 


Slow stride and movements 


Healthy looking, well developed and nourished 


Unhealthy, feeble physique 
Forceful, efficient, energetic, upright posture and gait 


Slouching gait 

Weak, inefficient movements and bearing 
Plump (pyknic) proportions Florid 
Well and symmetrically proportioned 

Thin (asthenic) 


Pale 
PERSONAL APPEARANCE AND EXPRESSION 


Attractive and good-looking (positive reaction) 
Pleasant 


г ADAM Sensual 
Uninteresting, indifferent attractiveness 
Ugly and repulsive (negative reaction) 
Strong expressiveness of face and gestures 


Frank 


Expressionless “ 


Quick and strong sense or humour 
Slow but sure 


! Unable to see humour 
Mature, Serious. philosophical 
Immature, ‘Aildish' 


SPECIAL CHARACTERISTICS 


Secretive 


Cheerful, optimistic 


Depressed, melancholy 


Excicable, irritable 
Even-tempered 
Calm, phlegmatic 
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SPEECH 
Voice Tesonant, pleasing, well-modulated | Clear, fluent, distinct 
Hard, harsh, pinched Stutters, stammers 


Expresses meaning directly, grammatically, with facility 
Unable to express himself, ungrammatical 


Garrulous, over-talkative Brilliant in talking, wide vocabulary 
Rather voluble | Dull and stolid, narrow vocabulary 
Seldom speaks of own accord 

Reticent, taciturn 


PERSONAL CARE > 
Fastidious in dress, over-manicured 
taste, neat and clean 
Passable and inconspicuous 
less in dress and cleanliness 
Slovenly and unkempt 


SELF-ASSERTION 
Pompous and overbearing 


Complacent Decisive 

Self-confident and possessed d) 
Wavering J 

Self-critical and deprecatory > 

Embarrassed, bashful, self-conscious Contrasuggestible 

Anxious, apprehensive 

Submissive, retiring Suggestible 


Co-oPERATIVENESS 
Willing to co-operate in every respect ; enters into spirit 


Reserved and formal 
Constrained and suspicious, outside the situation 
Surly and hostile 


Scrupulous, punctual and regular in attendance, application” 
Industrious А 
Easy-going, indifferent ha 

Lazy and irregular 

ALERTNESS AND CONCENTRATION 

Intelligently attentive, wide-awake 

Concentrated 


Absent-minded 
Easily distracted, inattentive 


Test REACTIONS: PLANNING 
Analytical d 
Serious but unsystematic = G 
Trial and error Profits by past experience 
Haphazard e 
1 Repeats same mistakes 

Emotion ‚ 
Wild and unrestrained emotional behaviour and remarks ``. 
Wilful and childish reactions, capricious o r ‘> 
Some loss of self-control, and overt emotion 
i umorous and unconcerned E 

C Serious, philosophical ^ 
€pressed and inhibited ^ 
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E. Haro EFFECT. AND THE RELIABILITY AND VALIDITY ОЁ, 
RATINGS j 
106. Halo effect.—Early in the evolution of rating methods it 
was found that sets of ratings on different traits when inter-correlated 
almost always yield unduly high coefficients. A pair of traits 
which appear to have no a priori connection with one another are 
yet discovered to possess some prominent common factor. For 
instance, in an investigation by the writer, ratings of students by 
their friends on Sociability and Quickness of Movement gave a corre- 
lation of + 0-71. Thorndike (1920) suggested that the common 
factor is the general impression of, or attitude towards, the ratee 
possessed by the raters, which.colours all their judgments of particular 
traits. For instance, we regard A as highly artistic, and tend to 
attribute to him all the other traits commonly associated with the 
artistic temperament ; or we think of B as generally melancholy, 
and fail to notice such of his characteristics as do not fit in with 
our over-simplified notion. Most commonly, it would seem, halo 
consists largely of our general liking for, or our dislike of the ratees. 
For it is usually found that the desirable or admirable traits give 
high positive inter-correlations, and negative correlations with 
undesirable traits. Doubtless this has some basis in actual fact ; 
persons of fine character do tend to be high on all good qualities, 
others do tend to be weak all round. But we are very liable to 
exaggerate this, and to attribute unwittingly all the virtues to our 
, friends, all the vices to our enemies. Hence we fail to distinguish 
properly between traits which should be relatively discrete. 
Uhrbrock (1932) has shown that taters who are personally selected 


by the ratees themselves are especially apt to over-rate on all 
desirable qualities. 

Many intelligent and conscientious raters no doubt realize the 
existence of this effect and try to allow for it, to dissociate their 
personal prejudices from their judgments. Yet even when fore- 


impossible for raters to avoid it altogether. 


preter and evaluator.” It js found for instance that different 
raters agree more closely with one another when they possess 2 
common slant towards the ratees. А schoolchild tends to possess 
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much the same personality in the eyes of two of his teachers ; his 
parents may also look on him alike, but the agreement between parents 
and teachers may be relatively small. This is partly due of course 
to the different environments in which teachers and parents see him, 
and his different behaviour in these environments, but it also reflects 
his different haloes. May and Hartshorne (1930) show that it is 
better to regard ratings primarily as the reputation of the ratees in the 
eyes of the particular set of raters. To do this need not imply that 
latings tell us nothing of value about the ratees ; indeed we shall see 
below that they do agree moderately closely with measures of 
behaviour. But it does mean that we make allowance for the sub- 
jective element, just as we are accustomed to do in everyday life» 
We do not usually take one another's opinions about other people 
at their face value, but normally interpret them in the light of our 
knowledge of the judges. A factory worker can tell us something 
about the personality of his foreman, a mother about her son, though 
we realize that they are both looking through coloured glasses. Simi- 
larly then, ratings should be interpreted with full regard to their origins. 

108. Improvement of reliability and validity by pooling ratings.— 
Although the judgments of a single rater have very little worth as 
measures of an individual's traits, yet it is clear that the pooling 
of judgments from different raters will tend to cancel out some of the 
subjective errors and so lead to ratings of greater reliability and 
validity. It is generally considered desirable to combine at léast 
five raters ; though the number varies with their experience. It is 
less often realized that raters should be chosen who have seen the 
rátee from as many diverse viewpoints as possible. For instance, 
the combination of ratings on a child from two teachers, two parents, 
and two of his friends is likely to give a much better picture of him 
than a pool of six teachers’ judgments ; biases whiclr will be reduced 
by the former plan will tend to reinforce one. another in the latter. 
In other words we should try to get a representative sample of the 
ratee’s various reputations. 

109. It should be noted that this scheme will not bring about 
so high a reliability (consistency) coefficient as will the pooling of the 
estimates of similar raters, although it will tend to improve their 
validity. We have here the same phenomenon as with attitude 
scales (cf. § 56) ; a very high consistency may indicate uniform bias 
rather than approximation to the truth. Again, when a second 
set of ratings is obtained from the same raters, a high repeat relia- 
bility is desirable, but it should not be taken as a criterion of accuracy, 
since it could equally well result from fixity of prejudice. 

Reliability, it is generally claimed, may be improved not only 
by increasing the number of raters, but also by increasing the number 
of ratings through the use of analytic scales. «We have already 
dealt with this point (§s 90-97), showing that som subdivision of 
traits into concrete items is worth while. We also noted that the 
different items should be (like the different raters) relatively in- 


dependent of one another. 
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110. Empirical evidence of validity—There have been iy 
studies of the validity of ratings, in the sense of comparisons ye 
other more objective criteria of people’s traits. Much more often 
they are assumed to be valid, and are used as criteria for determining 
the validity of some test of aptitudes or of character. Bowen 
the work of Hartshorne and May (1928), together with unpublishe 
work by the writer, where large numbers of tests as well as ratings 
were applied to the same subjects, do tend to show that an ed 
set of pooled ratings is superior to any single personality test or sho: E. 
battery of tests. The writer obtained an average validity porte 
of + 0-60, whereas most of the alleged personality tests yielde 
figures around + 0-30 to + 0-45. It is obvious from the previous 
discussion that we cannot expect measures of reputation and measures 
of behaviour to cover precisely the same ground any more than an 
attitude test can accurately predict a testee's actions. 

111. Co-operation and training of raters.—There is а further 
.possible flaw in ratings, namely, that they are not necessarily 
perfect measures of reputation. For just the same inhibitions, 
hesitations, misunderstandings, etc., are likely to be aroused in the 
rater when he is asked to assess an acquaintance on a rating scale, 
as when he is asked to fill in a political or religious attitude scale 
(cf. 55 57-58). Obviously, he will seldom give his candid bpinions 
if he has the slightest fear that the ratee will see these opinions. 
And his willingness to put down what he believes to be the truth 
may vary considerably with the social respectability of the trait ; 
e.g. he might admit that the ratee was extremely impulsive, but 
shrink from attributing to him extreme conceitedness or dishonesty. 
Clearly then it is desirable to take into account the attitude of the 
raters towards the task and towards the experimenter, and to Secure 
as complete ‘co-operation as possible before the rating begins. 
Conrad (1932) recommends that the rater should be interested in, 
and should realize the usefulness of, the ratings ; also that he should 


be allowed to take his own time over them. Kingsbury (1922) 
Supplies useful hints 


to be employed should be 


tions are clear. Instruction with regard to the main sources of 
error, and practice i 


ce in the application of the scale, should help to 
er the central tendency, the leniency tendency, and the halo 
effect. 


к F. FACTOR ANALYSIS OF RATINGS 
112. The inter-lockinz of ratings.—As early as 1906 Heymans 
T „Studied the overlapping between ratings on a large 
number of traits. For instance, they selected from their ratees those 
described as persistent liars and found 
to this group, ow emotional, how 
were, and sc*on, Though their te 
shadowed what is ‹ 
namely, the discoy 
traits by factorial 
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recently listed 17,593 personality traits. Naturally many of these 
are synonymous, and still more are likely to overlap. For instance, 
У, there might Бе а common element in “emotional, unstable, impulsive, 
excitable, temperamental, variable," and a dozen other such terms. 
| Factorizers hope then to distinguish a relatively small number of 
discrete dimensions of personality into which these thousands of 


traits may be resolved. 
118. Attempts to remove the halo factor.—Now ratings provide 
the easiest, and possibly the best single method, of getting measures 
Р y 8 8 ig me: 
of traits. But we have seen that all sets of ratings start out with the 
very influential common factor of halo, which is responsible for a 
high degree of spurious overlapping. As Guilford (1936) states, until 
this is removed or corrected, nothing can be deduced from trait 
inter-correlations. No satisfactory method for removing it has 
yet been devized, though several possibilities have been suggested. 
(а) Either the sum of all the ratings on desirable traits, or the first general 
factor to be extracted in analyzing the correlations should be regarded as, 
measuring halo, and should be held constant by the partial correlation 
К technique. This is scarcely fair because, as pointed out above, (§ 106), the 
| Positive overlapping of desirable traits is to some extent a real feature of the 
Personalities of the ratees. Nevertheless, in one experiment by the writer, 
the validity of ratings (compared with other measures of the same traits) 
T was somewhat improved by this expedient. 
(b) When factorizing measures of abilities we would expect the first, 
g, factor to account for the major part of the inter-correlations. But with 
measures of varied personality traits, the first few factors should presumably 
be more nearly equal in size. If therefore, as in most of the studies described 
below, the first factor is far larger than the others even after rotation of axes, 
its excessive size might be made to yield an index of halo. The flaw in both 
Y these suggestions is that they assume halo to be itself a unitary factor. 
Actually it is likely to be much more complex than mere desirability of traits, 
and so to enter more or less into all the factors extracted, though chiefly into 
the first. ч 
(с) Raters might be asked to estimate their personal liking for or dislike 
of the ratees, in addition to assessing their personality traits, and these 
\ popularity ratings might be partialled out from the trait ratings. For the 
Teason just mentioned, this would certainly not eliminate, though it might 
reduce, halo effect. 
(4) May and Hartshorne (1930), finding an average consistency of 
+ 0-92 among teachers’ ratings of pupils and among the pupils’ ratings of 
-48 between teachers and pupils, 


one another, but a correlation of only 4- 0 
point out that the excess of the former over the latter figure is due to the halo 


components in the ratings, and that the common ground between teacher 

and pupil ratings (represented by the lower coefficient) may correspond to 

the ratees' actual behaviour. They do not show, however, how the halo and 

у behaviour components can Бе effectively separated. More recently Chi 
] (1937) has attempted to implement the suggestion. He assumes that halo 
1 is an individual attitude which is different in each rater, A, B, C . ? . . etc. 
Thus the correlation between A's ratings of traits 1 and 2 (r A, А„ will be 
Spuriously increased by halo ; so will all other correlations,of the same type 
(e.g. 7 A,'A, or r B, Bj. But the correlation between A's-xating OF trait 1 
and B's rating of trait 2, and others of this type (r A, Ba 7 Ху B, etc.) will 
\ Not be affected. In а detailed investigation he has worked out by this means 
4 all the trait inter-correlations freed from halo (r A, B, etc and all those 
\ due to halo (г A, Aj—r A, Bs, еїс.). Unfortunately his major premise cannot 
accepted ; halo is not purely individual ; it is only too likely to be common 
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veral raters, especially when, as in this instance, all the raters were 
косса Either Hin the xuctbod must be extended, in accordance Mr 
May and Hartshorne's original suggestion, to groups of raters; or else anum E 
ofentirely independent raters, who are unlikely to possess any halo in COE ET 
must be obtained. We might then arrive at a good measure of the exten 
halo and of the inter-trait correlations when it is removed. Until this is done, 


the interpretation of the results of the following factor analysis studies 
is very hazardous. 


114. Webb's investigation.—The pioneer investigation in this 
field was carried out by Webb (1915), using ratings of schoolboys 


and students. By Spearman's technique, he extracted a factor 


Which seemed to represent general strength of character or will, 
and called it w. 


It was most strongly loaded with traits such аз 
“ conscientious, persistent, energetic, tactful, emotionally steady, 

and kind on principle." Webb could not have been aware at that 
. time of the pervasiveness of halo ; but it is clear now that the traits 

listed are precisely of the desirable kind which his raters would 
* be most likely to attribute to those whom they regarded through 

a favourable halo. How far then w should be considered as a funda- 
mental dimension of the ratees’ personalities, and how far it corres- 
ponds to their mere popularity, cannot be decided. 


115. McDonough’s, Cattell’s and McCloys’ 
(1929) adopted Kelley’s extension of Spearman’s method in analyzing a series 
„Of ratings of young children. She extracted four main 
"to represent Will, Cheerfulness, Sociabilii 
(1933) studied the overlap be 
descriptions of extrovert-introvert, cyclothyme-schizothyme, and other such 
e called Surgency-Desurgency 
ts, “cheerful, natural, sociable, 


otation of axes, fairly 
of docta. The first, 
ly desirable traits. The secon 


1. Self-important, sarcastic, haughty, grasping, cynical uick-tempered- 
Il. Friendly, congenial, broadminded, eae ue E 
III. Patient calm, faithful, earnest. : 

Perse ring, hard-working, stematic. 

v. Cay able, frank, self-reliant, posce] 


ad. c" 


е 
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m 117. Kelley's, Tyron's and Chi's investigations. Kelley (1934) applied 
F otelling’s technique to the analysis of tests and ratings of children’s “ courtesy, 
air play, honesty in school work, loyalty to fellows, mastery, poise, regard for 
Property rights, and school drive." The two main components seemed to 
Correspond to general social conformity or good citizenship, and individualism 
Or assertiveness. Tryon (1933) obtained different main factors in the two 
Sexes from ratings by children of one another. The chief constituents of 
the first factor in boys were “© active, humorous, friendly, leader, fights. 
daring, assured, happy, enthusiastic,” and in girls, “popular, happy, 
enthusiastic,’ 

Chi removed the individual halo element from teachers’ ratings of pupils 
by the method described above (§ 118), and then found a prominent general 
factor, similar to Webb’s, which he called volition in relation to the school 
environment, Among the correlations due to halo, he founc an independent 
factor which was difficult to identify ; it was most heavily loaded with traits 
Such as “ persistence and facing reality," least of all with “ muscular co- 


ordination.” 

. 118. Burt's investigations.—Burt (1915, 1938) has studied the 
inter-relations of assessments of primary emotional traits (e.g. feare 
rage, joy, etc.), both among normal students and children and among 
delinquents and neurotics. He used not only ratings but also’ 
records of observed behaviour, which were classified «under the 
headings of the various emotions ($ 93). The latter, while not 
of course entirely immune from distortions of subjective judgment, 
Should contain much less halo than the ratings analyzed by other 
investigators. The full results are not yet published ; but apparently 
they provide clear evidence of a general factor of emotionality 
running through all the separate traits, and a secondary factor 
named sthenic v. asthenic emotions, which corresponds roughly to 
the conventional dichotomy of extroversion-introversion. The 
same factors also emerged when correlations between persons (§ 68) 
instead of between trait assessments were analyzed. 

119. Conclusion.—This survey of the máin factor studies in the 
field of ratings shows some concordance, but also a good deal of 
divergence between the results of different experimenters, who start 
out from diverse lists of traits and use diverse statistical techniques. 
Several of them may perhaps be reconciled if we take the point of 
view that they are primarily analyses of raters’ opinions rather than 
of ratees’ traits; that is, if we regard them as showing the main 
ways in which different groups of people tend to think about per- 
sonality. The primary factors in Kelley’s and Chi’s studies show 
which from among the traits presented to the raters are most 
appreciated or depreciated by teachers. Tryon's results similarly 
reflect the views of boys and girls, and McCloy’s and Thurstone’s 
illuminate the psychological thinking of college students. At the 
same time there is undoubtedly some overlapping between these 
factors derived from ordinary ratings, and those yielded by Burt's 
more objective data ; there are resemblances also èo the products 
Of the self-rating tests, described in the next chapter ($372). So that 
though we cannot accept the claim that these studies have isolated 
the fundamental components of personality or character, nor decide 
аё present how far they-reveal distinctive types cf behaviour (as 
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contrasted with reputation), yet they do provide an interesting and 
promising method of attack upon problems of personality. 


Е. INDIRECT MEASURES DERIVED FROM RATINGS 
(i) Judging Ability 1 

120. If the judgments of a group of raters are studied, ы 
always found that some agree much more closely than others ir. 
the pooled results. Those whose opinions coincide best with t ; 
opinions of the group are commonly said to be the “ best judges о 
personality,” ог to show the most “ insight.” But if we bear is 
mind the halo effect, we see that “ insight " might equally well be 
interpreted as “ conformity." For goodness or badness of ratings 15 
measured in just the same way asconformity or atypicality of opinna 
is measured in an attitude scale (cf. §s 71-73). Thus suppose tha 
‘tater A happens to possess a very complete and unbiased knowledge 

. of ratee X, but B, C, D and E have a uniform, yet biased, attitude 
to X ; A's ratings would then turn out to be the worst. Sometimes 
a rater's assessments can be compared with a somewhat more 
trustworthy criterion than ratings (cf. Vernon (1933a), Wolf and 
Murray (1936)), but most investigations of judging ability have been 
based on this rather dubious standard. 

121. Consistency and validity of judging ability—Contrary to 
‘expectations there is found to be extremely little consistency in the 
ability. A judge may rate one trait well, another badly ; or he may 
rate X accurately and Y inaccurately. Either there is no such 
thing as general intuitive ability, or else it cannot be measured by 
this approach. However, there is fairly good evidence that in the 
long run better judges are slightly superior in intelligence, in artistic 
inclinations, “and in introverted, asocial tendencies (cf. Adams 
(1927); Vernon (19332) : Estes (1937)). 
that the extraverted, sociabl 


(s m viewing others impartially. In a study by Hollingworth 


elves better at rating it in others, and that 


1 t €.8. the most “ snobbish ” were bad at 
rating “ snobbishness ”). Wolf and Murray (1936) have also shown 


' judgments of a group of 
ratees, we are actually c i 5 ue 


(cf. § 68). Stephenson- (1936b) also discusses this point, and contri 
butes a factorial study of rate 


from ten teacbzrs judgments o 
determined. ame correlations 
factorized them. Two mai 


1 n types of rat se 0 
the first typé agreed ЧС, typ raters were found. Tho 


t the opposite held. ' 


65 


second type agreed well among themselves, but did not correlate 
with the first type. The first appeared to base their judgments of 
Reliability chiefly on placid, submissive behaviour, whereas the second 
type looked for more active and direct evidence of the trait. We 
seem then to have here a tool of considerable value for analyzing 
rating abilities. 
: (ii) Judg-ability 

> 128. As early as 1908 it was found that some ratees were more 

judg-able,” ie. more consistently rated than others. Allport 
(1937) suggests that those about whom the judges agree most 
closely are more “ open," less “ enigmatic " in personality ; and he 
finds a definite correlation between measures of '"open-ness," and 
ratings on traits such as extraversion and expansiveness. Some 
caution is however needed in interpreting this result, for judg-ability 
is likely to be fully as unreliable statistically as is judging ability. 
Moreover, it may be that those about whom the raters disagree are, 
more simply, the ratees who have no very outstanding traits, who 
are “ mediocre ” rather than “enigmatic.” Бог it has frequently 
been noted that extreme ratings are more consistent thanintermediate 
ones (cf. Cady (1923) ; Hollingworth (1929), etc.) also that those 
judgments about which the rater feels most, sure (and these are 
usually very high or very low) are superior to judgments given 
without assurance. All these researches, however, need to be 
repeated with scales in which the units have been made equivalent 
by psychophysical techniques (cf. § 87). Lack of assurance about 
ratings near the middle of the scale may be merely a function of the 
units, which are not generally based on discriminable differences. 


(iii) Empirically Standardized Scales 

: 124. The following technique assumes ‘great impertance in some 

of the tests to be described below, though it has as yet been applied 
to ratings in only one instance. Its main feature is the introduction 
of some external criterion for determining what the scale or test 
measures, instead of relying merely on the apparent meaning of 
the scale. Clearly the application of such an empirical check on 
the scale’s validity is highly desirable ; unfortunately, there is 
often a tendency, as we shall see later, to eschew all psychological 
considerations of the significance of the scale, and to trust wholly 
to the empirical criterion to give results. 

195. Olson’s seale.—Olson (1930) wished to devize a rating scale 
for personality maladjustment, or behaviour disorders, to be used 
by teachers; but was anxious also to avoid giving them a scale 
which dealt directly with such disorders,-since it might be very 
liable to distortion, by halo and other prejudices. Instead, he 
constructed an ordinary scale of ratings on 35 faily innocuous and 
common traits. This is published as the Haggerty-3lson-Wickman 
Behavior Rating Schedule B (1930). Its empirical standardization 
was carried out by applying it first t2 a special group of children 
(the “ standardization group "), who had already been rated very 
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thoroughly on a special analytic scale for personality maladjustment 
by raters who could be trusted. Taking next all those children 
who had received some particular rating on the general scale, he 
calculated their average score for personality maladjustment from 
the special scale, and so developed a maladjustment index for that 
rating. Similar indices were determined for each of the other 
possible ratings in the general scale. : 

Now when a teacher or group of teachers rates any child on the 
general scale, the child's personality maladjustment may be found 
from the sum of the indices pertaining to the various ratings he 
receives. Аз Olson points out, the same technique could be employed 
for estimating children's probable scholastic achievement ; the same 
general scale would be used, but each rating on it would be assigned 
a scholastic achievement index on the basis of its previous application 
to children of known high or low achievement. 

126. Discussion of the empirical technique.—The technique may 
seem unnecessarily roundabout, or even perverse, since it takes no 
“account whatever of the conventional meanings of the ratings on 
the general scale. The significance of such ratings as measures of 

. personality maladjustment or scholastic achievement is determined 
wholly empirically. . We will not try to assess its value until we have 
seen what results it gives with other types of test ; but would point 
out here two important requirements if it is to be made to work. 
First, the initial measure of maladjustment or achievement must be 
as accurate as possible. In Olson’s investigation the initial measure 
was derived from ratings on the. special scale, and is therefore of 
uncertain validity. Secondly, the standardization group must be 
very large, for otherwise it is found that the standardized test yields 
hopelessly inconsistent results. The consistency of Olson’s final 
test was fair ; for when tried out on a new group it gave a correlation 
with ratings on the special scale of + 0-62. 

It is worth inciderital mention that the extreme ratings on 
Olson’s general scale almost always obtained the highest maladjust- 
ment indices. For instance, on the trait Physical Output of Energy, 
those ratees who are checked as “extremely sluggish” score 5 for 
maladjastment ; those called “ over-active, hyperkinetic " score 4; 
the intermediate steps, “ slow in action, moves with required speed, 
and energetic-vivacious " receive indices of 3, 2 and 1, Child 
Guidance experts would probably agree that most extreme traits 
may be indicativevof poor adjustment. In this instance then the 
empirical method has giyen quite meaningful results. 


V.—SELF-RATINGS AND PERSONALITY QUESTIONNAIRE 
TESTS f 


„A А. DESCRIPTION or Trsts 

T е (i), Introduction 
197. Self-rating tests, personality “ inventories ” ог“ schedules,” 
or "' pencil and paper tests’ (as they are sometimes called) follow 
precisely the saine lines as individual attitude tests and analytic 
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rating scales. Their object is to obtain in quantitative terms an 
individual's estimates of his own character and temperamental 
traits. Ina number of experiments, individuals have been instructed 
to rate themselves on the same scales with which they rate others. 
It is commonly found that, while an individual's associates may 
differ considerably among themselves in rating him, his judgments 
of himself tend to be still more divergent from the group judgments ; 
and that he is especially apt to over-rate himself on desirable traits, 
ie. to possess a favourable halo towards his own character. Since 
then self-ratings, either on a single trait, or on а series of traits, are 
of such dubious significance, an analytic scale technique is much more 
frequently adopted. р ‹ 

198. These analytic scales include a large number of items 
(anywhere from 10 to 223) bearing on the general trait, and they 
are therefore much more likely than single self-ratings to yield 
reliable measures. Moreover, the items cover а wide range of pre- 
sumed manifestations of the trait; they can deal primarily wit 
questions of the testee's behaviour rather than with his self-evalua~ . 
tion, and they may be to some extent disguised (in much the same way 
as in the Study of Values or Watson's F ‘airmindedness Test (S8 30-31)). 
Hence their validity should also be improved. i 

It is probable that a hundred or more of such tests have been 
published. But the great majority are simply modifications or 
extensions of three prototypes : Woodworth's Personal Data Sheet, 
Freyd-Heidbreder's Introversion-Extraversion: Test, and Allport's 
Ascendance-Submission Test. We will, therefore, outline the origin 
and construction of these three, and mention a few others which 
have been widely used, or which embody special points of technique. 

The form of test items or questions 15 the same as in attitude tests 
or graphic ratings. All of them have been standardized by one or 
other of the three methods we have already described : the internal 
consistency technique (§s 35-43), empirical'techniqués (Ss 124-126), 
or the external judgments and Thurstone-scaling technique (58 )- 
Almost all of them are intended to be used as group tests. 


(ii) Tests of Emotional I nstability or Psychoneurotic Tendency 
199. Woodworth’s Personal Data Sheet or Psychoneurotic In- 


ventory.—The 116 items in Woodworth’s test were originally derived 
from descriptions by psychopathologists of the symptoms of neurotic 
patients (e.g. from J. Т. MacCurdy’s War Neuroses). The following 
are some representative examples:— —, Р 
Do you usually fee! 
Do you ever walk in your sleep? ; 
Нахе уоп еуег h ў 
Did you have а happy childhood? _ 
Do you know of anybody who 15 trying to do you harar 
Does it make you uneasy to cross а bridge over a/rivert .. 
Have you ever been afraid of going insane? 
: Наз any of your family had a drug habit? E 
Each question is followed by “ Yes No", one of which is to be checked. 
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180. Mathews’s, Laird’s and Thurstone’s tests.—Mathews (1923), 
Cady (1923) and others have adapted the test for use with children. 
Laird’s (1925) Personal Inventory B2 contains a similar series of 
items, but with multiple choice (graphic) responses, e.g. : 

Hae 95 Е avoided it accepted didnot liked it welcomed. 


few months) when forced mind it it 
been afraid of upon me 
responsibility ? 


Thurstone’s Personality Schedule (cf. Thurstone and Thurstone 
(1930)), which is now the most widely used, contains 223 items 
collected from .Woodworth, Laird, and other sources. 

131. Standardization and Seoring.—In these, and the many other 
derived tests, it has usually been shown that the items do hang 
together consistently, by the internal consistency technique. Or 
items may be selected from a more extensive preliminary draft on 
‘the basis of their differentiation between testees who obtain extreme 
-Scores on this draft. Thus Willoughby’s (1932b) Clark-Thurstone 
Schedule includes the 25 best differentiating items from Thurstone’s 

Schedule ; graded responses (0 to 4) are provided. Ê 

The testee’ score usually consists of the total number of items 
(unweighted) to which he responds in the psychoneurotic direction ; 
it is a “ width " rather than ап“ altitude " type of score (cf. $48). 
Percentile norms are generally provided for Showing whether his 
Store should be deemed high, low, average, etc. 


182. Burt's Questionnaire on Neurotic Symptoms.—Burt (1937c) 


has published an English adaptation of the Woodworth апа Thurstone . 


tests, but does not intend it to be scored quantitatively. Не employs 


it simply as one device for eliciting qualitative information about 
personality problems in an interview. 


(iii) Tests of I ntroversion-Extraversion 

183. Freyd-Heidbreder’s test.—Freyd (1924a) collected 54 items 
descriptive of the introvert type from Jung’s writings, of which the 
following are samples :— 

Blushes frequently ; is self-conscious, 

Daydreams. i 

Prefers to read a thing rath 

Shrinks when facing a crisis, 

Ts reticent and retiring ; does not talk spontaneously, 

1з slow in movement. 

Keeps in the background Оп social occasions, 
Heidbreder (1926) turned them into a self-rating test ; the testee 
15 instructed to check each item » ؟‎ Ог —, according as it applies 
to him or not. Laird's (1925) Persona] Inventory C2, and à host of 
other adaptatioris are available, 

: ur Tests standar on psychotic patients.—Some psycholo- 
gists, distrusting the internal consistency technique, have attempted 
to apply an external criterio; i 
Le. to adopt am empirical 


er than experience it. 
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(1929) worked on the rather doubtful assumption that schizo- 
phrenic or dementia praecox patients represent the extreme of 
introversion and that manic-depressive patients are extreme extra- 
verts. Their test contains the fifty items which were found best to 
differentiate between two such groups of patients. A shortened 
form, prepared by Root, was used in Wyatt's (1937) research, and 
is reproduced in full in Report No. 77. The Neymann-Kohlstedt, 
and other tests which have been similarly standardized, tend to give 
very poor correlations with introversion-extraversion tests that have 
been standardized on normal persons by the internal consistency 
technique ; conversely, the commoner form of test differentiates 
poorly between such mental patients. f 


(iv) Tests of Ascendance-Submission and Other Personality 
Traits 

135. The Allport A-S Test.—In this test (cf. Allport (1928), 
the items were devized as concrete manifestations of dominatingness 
(ascendance) or submissiveness, e.g. :— "sj 

A salesman takes manifest trouble to show 
you a quantity of merchandise; you are not 
ee suited ; do you find it difficult to say 

о 

If you hold an opinion the reverse of that In class 
which a lecturer has expressed in class, do you After class. 
usually volunteer your opinion? Not at al..........« 2395 
Ап alternative form is available for women, and an adaptation for 
children has been prepared. Empirical standardization was used. 
Allport first obtained ratings of students by their associates on 
ascendance-submission ; he then applied a preliminary draft of the 
test. to those who were rated as most ascendant or most submissive, 
and, on the basis of their responses, chosé the best items and calcu- 
lated an appropriate weighted mark for each response. The marks 
for the above items are — 1,0, + 1, and 4-3, — 1, — 3, respectively. 

136. Tests of other traits.—A test for Inferiority Feelings 
based on Adler's writings, has been compiled by Heidbreder (1927), 
along the same lines as her Introversion-Extraversion Test. Bern- 
reuter (1933c) has published a test of Self-sufficiency v. Dependence 
on Others, Jasper (1930) a test of Depression-Elation, which are 
similar in form to those already described. Cason (1930) gives а 
list of 217 Annoyances, i.e. situations which tend to annoy people. 
Tf a testee rates each of these from + З (extremely annoying) to 0 
(not annoying), his average rating can be ased as a test of Irritability 
or Annoyability. ٤ 

(v) Tests Based on External Judgments 

187. In Wang's (1932a) test of Persistence, lll items were 
selected from a larger number in accordance withthe views ot 75 
judges, who decided whether each Item described а persistent or 
non-persistent person. The Thurstone scaling technique was 
adopted by Willoughby (1932a) in constructing his E-M (emotional 
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maturity) Scale. He collected 150 items descriptive of symptoms of 
emotional maturity or immaturity, e.g. :— 

S develops affective difficulty іп. the presence of a 
necessity for precise or realistic thinking, e.g. mathematics 


S organizes and orders his efforts in pursuing his 
objectives, evidently regarding systematic method as a 
means of achieving them 


Everything in the world is against me (Scale value 0-9) to : 
~~ Life could not be better for me (10-7). 
The authors show that this gives 


patients, normals and depressives 


Both these scales m “altitude " Scores, based on the average 
scale value of the en orsed items. 


good differentiation between manic 


(vi) Multiple Tests 

138. Classification of Dsychoneurotic-test items.— Tests such as 
Woodworth's or Thurstone's obviously contain a wide range of 
Symptoms drawn from many distinct neurotic or psychotic condi- 
tions. It would be Possible for several testees to give psychoneurotic 
answers to entirely different sets of, say, 20 items. Although this 
would indicate hat they were completely different fro; 
yet they would all get the same score and 
unstable or psychoneurotic. Surely it wo 


correlating scores on these six Sets, he 
the sets were really distinctive (cf. $ 145). 


139. Cattell’s tests.—Cattell (1936) has published a questionnaire 
test with separate set 


4 € ven pathological syndromes— 
Neurasthenia, Anxie y U nxiety Hysteria, Conversion 
Hysteria, Obsessire-Compulsive, Epileptoid and Paranoid. Realizing 
the difficulty (“hich we discuss below) of obtaining candid Tesponses 
to such intimate i i i 


с questions, he intends it onl: 
and highly-co-operative testess, ог el m 


by a clinician rating a patient. In a 


— 


eO X.-—————tPgB 
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Tesi, he eliminates all direct questioning of the testee, as shown 
by the following item :— 

John strained every nerve to beat the others because : 

he was determined to be top.....----++----- 

his father wished him to succeed neo 

he needed the scholarship..........---+++++-- 
The testee is told to check the most appropriate of the three endings. 
It is assumed that he will tend to project his own chief impulses 
into the endings he chooses. Thus, if he is very self-assertive, he 
would be likely to check the first in this example ; if submissive he 
would prefer the second. The test contains 74 such items which are 
classified so as to yield scores on Selt-Assertive v. Submissive 
tendency, Cautious v. Bold, Acquisitive, Gregarious, Curiosity, and 
Dependent or Appeal tendency. This classification is adopted from 
MacDougall's list of instincts. The theoretical basis of the test is 
ingenious, but it is unstandardized, and the practical results аге:50 
far dissappointing. 

140. Boyd's Personality Questionnaire.—This test (unpublishea) 
is apparently the only one that has been widely used as a group test 
in Britain. Its 120 items are classified under twenty headings or 
general tendencies, including the following. :—. 


Tendency Sample Question 

Obsessional carefulness .. Do you often go over a job again and again 
to make it just right ? ' 

Worry, Anxiety .. Do you brood long over humiliating or 
unhappy experiences ? 

Suspiciousness 22 .. Do you sometimes suspect that people are 
talking about you? 

Self-consciousness Are you greatly interested in what goes on 

(Introspectiveness) in your own mind ? 


The testee is, however, not told about these tendencies, and the ques- 
tions are so arranged that he is unlikely to guess that six of them 
deal with carefulness, six with worry, and so on. Questions may be 
answered Yes, Yes ?, 0, No?, or No, or omitted. These are marked 
4, 3, 2, 1, 0 and 2 respectively ; so that the scores on each tendency 
may range from 0 to 24. Whether the twenty tendencies are really 
distinctive and self-consistent will be considered below (§ 146). 

141. Maller's Character Sketches.—Maller (1932) has adapted the 
Guess-Who rating technique (cf. $ 81) in a self-rating test forado- 
lescents and students. Two hundred short descriptions are given, 
and the testee has to say whether or not he feels or acts like the 
person described. This impersonal form is apparentiy much less 
disturbing than direct questions. Examples are :— 

This person insists on havirg his own way and likes to 
command and rule everybody 

This person finds it difficult to forget unpleasant memorie‘ 
and can’t help thinking about then.- 


Every item is repeated elsewhere it. the test, but tn reverse form, 
e.g. : 


This person never insists on having his own way and does 
not like to command and rule everybody 
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By this means the piling up of descriptions of “ unpleasant ” people 
is avoided (cf. $158); also the carefulness of the testee can be 
checked by noting whether he answers each reversed question 
in the opposite way to the original. In the process of standardi- 
zation it was proved that all the questions differentiated significantly 
between groups of 310 normal pupils and 308 problem cases, 
delinquents, etc. The questions are classified under six headings : 
Desirable character traits 
Self-control and integration 
Social adjustment (extraversion) 
Personal adjustment (freedom from anxiety) 
Mental health (freedom from Psychotic or neurotic symptoms) 
` Readiness to confide in others. 
The extent to which these different categories overlap is not stated ; 
but their average intercorrelation is + 0-38 + -08. 
, 142. The Humm-Wadsworth Temperament Scale.—This test 
(1935) is perhaps the most logically worked out of, any we have 
. described. It aims to measure the. seven 


, from psychiatric diagnoses, to be strong 
or weak on these components. Marks for the items were calculated 


context (cf. footnote $ 35). 
making the scale rather long ; 
is 55 minutes, 


niis is claimed that the Scores obtained on these seven tendencies 


of No's ; highly suggestible persons give too many Yes's. 
143. The Bernreuter Personality Inventory.—This test (1981, 


2 : the items were not 
logically selected in 3 у 


Were іп Allport’s, Neymann-Kohlstedt's, Maller’s and Humm- 
Wadsworth’s tests). Instead, they were 


from Various tests. Each of th 
an indication Of all the fou: 
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of the 375 responses, four marks were calculated on the basis of 
the agreement of that response with the four previous test results. 
Thus the answers to the item, “Do you day-dream frequently ?” 


are marked :— Neurotic Intro- Domin- Self- 
Answer Tendency version ance Sufficiency 

Score. score. Score. score. 

Yes оваа SETS Jes [| +1 

No г, ВАТ RSE a =1 

Doubtful zs 29 —2 0 +2 —2 


A testee's four scores are the sum of such + and + marks. It will 
be seen that the construction and standardization of the test are 
analogous to the procedure used in Olson's rating scale for person- 
ality maladjustment (§s 124-126). Nominally they are wholly 
objective ; but in point of fact the criteria employed were far from 
Objective, since they consisted of four self-rating tests. Never- 
theless, the test works sufficiently well to have achieved tremendous 
popularity ; it is said that some 50,000 copies of it are sold annually. . 


B. MULTIPLE FACTOR ANALYSIS OF SELF-RATING TESTS 


144, Overlapping of traits tested by personality questionnaires.— 
The attempt to classify test items or symptoms lógically into distinct 
groups has not, we must admit, been successful. On the one hand, 
it is found that tests of presumably different traits intercorrelate 
very highly; on the other hand, different tests of nominally the same 
trait, although fairly reliable in themselves (cf. $ 168), tend to give 
very poor correlations withoneanother. Itisdoubtfulthen whether 
most of the traits at which the tests have been directed are unitary 
and discrete. Obviously, the psychoneurotic questionnaires include 
a hotch-potch of symptoms ; and introversion-extraversion tests 
appear to cover almost as diverse а. collection, e.g. lack of social 
interests, inhibition of emotional expression; etc. The two concep- 
tions indeed overlap very largely. The present writer has collected 
from the literature the results of 40 experiments, which show that 
the average correlation between different introversion tests, and the 
average correlation between introversion and psychoneurotic tendency 
tests, are practically identical, namely + 0-36 + ·10. A further 
18 experiments with the A-S test show an average correlation of 
- 0-30 between submissiveness and introversion or psychoneurotic 
tendency. Tests of inferiority feelings also agree quite closely with 
tests of introversion. When these traits are combined in a multiple test 
such as Bernreuter's the overlap is much higher. The average of four 
experiments where the Bernreuter scores were intercorrelated is* :— 

Neurotic tendency with Introversion 4- 0:93 
Neurotic tendency with Dominance — 0-81 
Introversion with Dominance — 0-67 


* The Self-Sufficiency scores ате relativel; distinct ; they co-relate — 0-41, 
— 0:32 and + 0-58 with the other three. But the main reason for this is 
Obvious when one examines Bernreuter's Self-Sufficiency Tist, namely, that 
it is itself an amalgam of Ascendance + Introversion items. Further com- 
Ment on these extremely high Bernreuter correlations appears below (8170). 
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d Personalit uestionnaire, the average intero 
a bou nineteen p e obtained in the writer's expert 
ents, was 0-366 + - . E 
145. Results obtained by multiple factor analysis Muli 
factor analysis has therefore been applied in an attempt to 5 ü 
objectively which types of items do hang together Conse D 
and which are independent. In two researches d tests have be 
volved to measure the factors that have emerged. 
Г In Willoughby’s (1932b) study of the Thurstone Schel 
nearly 50 per cent. of the correlations between his six RT d 
items were accounted for by one factor (which incidentally se 
the Spearman tetrad-difference criterion). Perry (1934) app А 
the Bernreuter, Laird В? and C2, A-S and other tests to a group о 
students and obtained first a large factor which was chiefly ma: d 
up of Bernreuter Neurotic and Introversion scores and the Гак 
` scores. А second factor seemed to represent a separate Dominanci 


:, tendency, being highly weighted with A-S and with Bernreuter 
` Dominance and Self-Sufficiency, 
Flanagan (1935 


y were in effect measuring only two 
distinct tendencies, not four. The first, a compound of Neurotic, 
w Self-Sufficiency scores, seemed to 
This accounted 
for 78 per cent. of the variance. * The second, which covered 18 per 
› he identified as a “ Sociability " factor; the third and fourth 
ince they were responsible for only 
4 per cent. of the variance. Flanagan has constructed fresh 
nse to the Bernreuter items, so that the 

test can be scored for these two factors. 
of the Boyd Questionnaire.—The 
50 men and 50 women students on 


of the Scores), and three further factors accounted # 
together for another 35 per cent. 


factor residual correlations was only 
greater than three times its probabl 
tion was not worth while. Tt follo 5 
very far from distinct ; each one, Moreover, gives appreciable 
Correlations with at least two of the factors. к 

ter three rotations of axes it appeared possible to identify the 


factors tentatively. „The first Was prominent in almost al] the 
measures, but especially in the following :— 

Depression or Melancholy ; Instability or Temperamentalness ; Worry a 
Anxiety ; Lack of Self-Control ; Shrinking of Responsibility ; Lack of Sel 
Sufficiency jr Confidence, $ 


D 


T The twentieth measure was too unreliable to be of any use." This 
figure is Correctéd for attenuation, 


à 
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These all gave loadings of + 0-73 to + 0-57, and suggest а general 
self-depreciatory or psychoneurotic tendency. 

The second factor, independent of the first, might be called a 
tendency to ''Caré-freeness," its highest loadings (0-48 to 0-30) 
being with :— 

Shrinking Responsibility; Suggestibility ; lack of or freedom from 
Worries, from Self-Consciousness and Emotional Thinking ; Dissociation or 
Unintegrated Thinking ; Inability to Concentrate ; Lack of Definite Interests ; 
and freedom from Tenseness. 

The third factor, again independent, gave the clearest picture, 
which might be named “ Scrupulousness.” With twelve measures 
the loadings were less than + 0-10; but with thefollowing seven 
measures they were 0-60 to 0-24 :— 

Obsessional Carefulness; freedom from Instability-Temperamentalness ; 
Acting Readily without Pressure; freedom from Emotional Thinking and 
from Inability to Concentrate ; Suspiciousness ; and strong Self-Control of 
Feelings. е 

The fourth factor consisted chiefly of those measures in which 
the women scored higher than the men, or vice versa. It is not easy 
to interpret, but certainly represents sex differences dn this test. 
The highest loadings (0-56 to 0-22) were, for womén :— 

Strong dislikes; Strong Fears; Instability; Lack of Self-Sufficiency 
(i.e. Dependency) ; 
and for men :— 

Persecutory (or expecting consideration from others) ; Suspiciousness à 
Inability to concentrate ; and Self-Consciousness ог Introspectiveness. 

In judging the significance of these results it should be 
remembered that the testees were student teachers passing through 
the strains of adjustment to their career; also that the test certainly 
does not presume to cover every side of,personality, for instance 
it omits almost all social and moral qualities, interests and values. 
The factors naturally depend on the types of questions used, and on 
the people who answer them. Within these limits they do appear to 
represent rather general and meaningful tendencies. 

. 147. Analysis of introversion-extraversion.—Perhaps the most 
illuminating research in this field is Guilford's analysis of introversion- 
extraversion (Guilford and Guilford (1934, 1936)). Taking 36 typical 
items from introversion tests, he calculated the inter-item correlations 
and found a fairly prominent general factor, which was chiefly 
loaded with: items indicating sociability or the reverse. Later, 
he applied Thurstone’s technique, and after rotation of axes, 
obtained five distinct factors, which could be fairly readily identified 
as :— 

I.,, Social-asocial tendency à а 4 

II. Emotional immaturity or dependency 

III. Masculinity or Aggressiveness 1 

IV. Care-freeness 7 * 

V. Intellectual interests е © 
(The common conception of extraversign would then: be made up of 
I, ПІ, IV and of the reverse.of II and V). Guilford has now published 
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the Nebraska Personality Inventory of 100 items, which can be scored 
to measure the first three of these factors. 

148. Layman’s factorial study.—In a detailed research, the full 
results of which are not yet published, Layman (1937) obtained 
twelve independent factors from the correlations between e7 
personality test items. These items were selected from severa 
tests, and were applied to 276 students. After rotation of axes the 
factors were identified as follows :— 

I. Sociability factor, (i) Gregariousness 

П. » (ii) Feeling of Social Inadequacy 
д) m (iii) Social initiative 

» 5 (iv) Social aggressiveness 

V. Changeability of interests 

VI. Independence or self-sufficiency 


VII. Feeling of inferiority or lack of self-confidence 
VIII. Impulsiveness 


IX. Emotionality factor (i) Moodiness 

. » » (il) Sensitivity or excitability 
XI. » » (iii) Emotional introversion 

XII. Inability to face reality 

149. Conelusion.—If we compare these studies we certainly do 
not find that they yiéld identical factors ; this could hardly be 
expected in view of the diversity of the test material with which 
they started. The majority however agree in showing the pre- 
dominating importance of one factor, whatever the test or the items 
used*, which may be interpreted as lack of self-confidence or insta- 
bility or. maladjustment of personality. Several also show an 
independent sociability factor. When both men and women are 
tested, there is a sex difference factor. And Guilford and the writer 
Seem to concur rather closely on a “ саге-їгеепеѕѕ” factor}. 
Layman's study suggests that these rather general tendencies. may 
be split up into several different components when a more detailed 
analysis is made. Tt is possible that her Factors I and II correspond 
to the sociability, III and IV to the aggressive-masculine, V and VIII 
to the care-free, and the remainder to the general adjustment factors. 
Thus, although other, somewhat different, patterns of factors are 


likely to emerge from further researches, yet the results so far 
obtained do séem to have effected а considerable clarification of 
the field. We see also 


that self-rating tests are not likely ever to 


* One further stud 
analysis. Rundquist and Sletto (1936) c 
timis 


adjustment. Apparently then, th 


personality questionnaires. 
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be able to cover all the main variables in personality, since items or 
sets of items which attempt to measure a number of different traits 
actually indicate only a limited number of distinct dimensions. 
We shall return later (§ 171) to the discussion of the psychological 
significance of such dimensions. 

150. Factor studies of annoyances.—Two factorizations of the 
Cason Annoyances Test are somewhat more discordant. Carter, 
Conrad and Jones (1935) found a large factor of ‘ general annoya- 
bility," and minor factors relating to annoyances caused by untidi- 
ness, by characteristics of people, etc. Harsh (1936) however 
arrived at five main dimensions, namely annoyances due to :— 

I. Appearance of others 

II. Violating of morals or mores 

III. Suggestions of superiority in others 

IV. Unintentionally disagreeable acts 

V. Personal sensitivity. 
The lack of agreement may merely reflect the different types of 
testees employed, and the somewhat different preliminary classifi- 
cations of the test items. 

151. Factorial studies of correlations between persons.—Harsh 
also applied the “ Q-technique » to discover persons possessing 
different types of annoyability. His full results have, however, 
not yet been published. 

Stephenson has carried out two similar studies with self-ratings. 
In one of these (1936c) 21 persons rated themselves on 22 traits 
connected with psychic tempo (Smart, Fluid, Temacious, Pedantic, 
etc.) Two independent types of persons sufficed to account for the 
inter-correlations. The first group seemed to correspond to the 
normal type, regarding themselves as “lively, fluid, smart, 
tenacious," etc., and rating themselves low on “ inhibited, moody, 
distraught.”  The.second group was less easy to interpret ; they 
rated themselves high on “inhibited, comfostable flow of energy, and 
distraught,” low on “ fluid, bustling, fanatical, flighty and pushing." 

159. In the other research (1936b), ratings were obtained from 
normals, from manic-depressive and from schizophrenic patients on 
the relative prominence in themselves of thirty moods (' cheerful, 
sensitive, nervous, affectionate, serious," etc.). Distinctly different 
patterns of moods were found in the abnormal groups, though 
the members of each group correlated closely among themselves. 
The normals were saturated with these twò types to varying extents. 
Stephenson suggests that a test of cycloid of schizoid tendency 
could be constructed by correlating such self-ratings with the 
characteristic patterns of the patients. A test along these lines 
would indeed possess several advantages over the ordinary self-rating 
tests described above; but the proposal has not yet been followed up. 


C. RELIABILITY AND VALIDITY „OF SELE-RATING TESTS 
153. Objections to personality self-rating tests.—Many persons 
on first seeing some of these tests tend to react either with ridicule 
or disgust. To them it seems obvious that candid answers to such 
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intimate questions will never be given, so that the results will be . 


entirely worthless. In most countries also psychologists manifest 
a similar distrust. As mentioned above, Burt (§ 182) uses his test 
only as an introduction to a personal interview. In much the pu 
way the vocational guidance examiners at the National Institute 0 
Industrial Psychology obtain from the candidate self-ratings on 4 
list of traits, which are then discussed in the interview (cf. Rodger, 
1934). No quantitative results with these, nor with Cattell’s or 
Boyd's personality questionnaires have been published so far. 
Very occasional reports have appeared of the translation and 
application of Woodworth’s or Thurstone’s or other inventories 11 
France, Germany, Spain, Poland, Australia and China. 

154. Uses of self-rating tests.—The contrast in America is very 
striking. Hundreds of investigations have been carried out with 
such tests, mainly on University students, but also on children, 
mental hospital patients, delinquents, and other groups. Some 
Child Guidance and University Clinics employ them regularly, to 
aid in the diagnosis of maladjustment. Immany hundreds of schools 
in the Middle-West they are applied to all the children of a certain 
age in the hope of picking out problem cases, with what success 
we do not yet know; Often also they are used as criteria for assessing 
the validity of other alleged personality tests, e.g. for investigating 
the significance of handwriting, or of endocrine records, as measures of 
Personality traits. But the commonest type of study consists of 
little more than the tabulation of the scores, or of the responses 
to particular items, given by special groups such as academically 
successful and unsuccessful students ; criminals ; identical and non- 
identical twins ; spinsters ; married couples and divorcees. Whether 
such studies have revealed anything new or important, or whether 


the tests can validly be applied in clinical practice, is rather doubtful. 


Indeed it would seem, as though American psychologists, flushed 
with the success of their 


methods of measuring intelligence and 
aptitudes, have incautiously assumed that emotional traits could 
be measured in the same way. А count of the number of intellectual 
tasks which a testee can accomplish does indeed provide an index 
of his intelligence ; but a count of the number of symptoms which 
he checks in the Woodworth or other inventories is not necessarily 
an adequate measure of his instability or psychoneurotic tendency. 

155. Justification of the tests.—Nevertheless we cannot simply 
tigations as utterly worthless. 


an extremely valuable source of 


› 


= 3909, " ce which is not tapped 
either by more Objective tests or by associates' ratings. In the 


156. Importance of testae’s attitude. —The outstanding feature 
of the tests is that so many 
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ов emotional tone and personal significance, matters which few 
us are ready to reveal to the public gaze. We might discuss 


them with a sympathetic and trusted friend or psychoanalyst, but 


would naturally hesitate to commit them to writing which some 
d. Thusallthe inhibitions, 


relatively unknown experimenter will rea 

ров and difficulties mentioned above (8s 21-25, 97-98) are 
ikely to be greatly intensified. Until recently most American 
psychologists paid little attention to these subjective factors, 
believing that so long as а standard objective situation was enforced 
among all the testees, results would be obtained whose validity could 
be checked by purely empirical methods. But the paucity of such 
results seems now to have converted many to a realization of the 


Importance of the testees’ attitudes to the tests. 

157. Conditions affecting testees’ attitudes. Thus Maller (1932) 
admits that 70 per cent. of students were irritated by the usual type of 
inventory, and so substituted the impersonal form of question 
in his Character Sketches (§ 141), which only irritated 43 per cent. 
In several experiments he found that a careful preliminary talk 
on the value of the test and on the desirability of frankness increased 
the numbers of maladjusted symptoms which the testees admitted ; 
that when they were told that the test would he used to determine 
vocational fitness, or when they had to sign their names instead of 
answering anonymously, the maladjustment scores decreased. 
Olson (1936) and others have also obtained different results from 
signed and unsigned tests, though the statistical significance of the 
differences was dubious. Layman (1937) asked her testees which 
items they would answer differently if they had to sign their names ; 
: such that a frank answer “ might tend to belittle 
the individual in the eyes of the social group.” 

158. Socially acceptable and unacceptable items.—Uehling (1934) 
criticizes the internal consistency technique of standardization 
of the most “unpleasant » items, and 

i He advocates retaining 
predictive indices but 


which may calm down the testees. (Such non-diagnostic items are 
often referred to as “ jokers ”’). The comparison by Smith (1932) and 
ру Rundquist and Sletto (1936) of results obtained with socially 
acceptable or positive items and unacceptable or negative items 
(cf. § 88) is very illuminating. Smith included items such as the 
following in a test of inferiority. feelings :— 
Feels people speak well of him and like hime 

Feels people criticize him and dislike him 
He found that a considerably greater number would admit that the 
former (positive) type did not apply to them, than the number who 
did apply. The two 


would admit that the latter (negative) type 4 
Rundquist and Sletto find the 


types inter-correlated Very poorly. Rund 1 
negative type to be more highly consistent than the positive, 
and consider that this is because they oll arouse а suspicious, hostile 
or evasive attitude ; hence, testees answer them gil alike, regardless 
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of their real meaning. Positive items, on the other hand, аге con- 
sidered more calmly, and so evoke a greater diversity of response 
and lesser consistency. It is noticeable that the majority of 
` questions in the most commonly used tests of psychoneurotic or 

introverted tendencies are of the negative type: though some, 
such as the Neymann-Kohlstedt, Ё-М, А-5, and Character 
Sketches, include both types in equal proportions. 

Bernreuter (1933b) has shown, however, that the average testee 
does not merely check the socially desirable responses in his Inventory, 
nor answer, simply in accordance with a self-ideal. For in an ехрегі- 
ment where instructions were given to respond in these specific 


ways, the scores were quite different from the scores obtained with 


the ordinary instructions. 


159. Not all the questions, of course, are as intimate or as nega- 
tive as “Do you feel-that life is a great burden?”, or, “ Have your 
relations with your mother always been pleasant?". Some of the 
Woodworth-Thurstone items, e.g. “ Do you have a great many bad 
neadaches?" and many of the questions in Allport's A-S test deal 
with comparatively unemotional matters, with facts of past history, 
or with present physical characteristics, The difference between 
these two extremes has been demonstrated in several investigations. 
Neprash (1936), Lentz (1934) and Johnson (1934) analyzed the repeat 
reliability of items, and found that the more objective ones were 
most reliable, whereas items involving judgments of the testee’s 
own mental states were most subject to change. Willoughby and 
Morse (1936) applied a 40-item inventory in individual interviews, 
and noted the spontaneous comments vouchsafed by the testees. 
Items “ concerned with superficial or conventional matters which 
touch no affective springs” aroused little or no comment, whereas 
items which +“ touch complexes of high affective content—sex, 
guilt, fear of disapproval . , . " frequently aroused amused, resent- 
ful, or embarrassed reactions. There is, of course, no complete 
Separation between these two types : affective reactions and factual 
Statements will intermingle in the same way that interpretation and 
observation overlap in the rating of others (cf. §s 91, 104). To 
quote Willoughby again: “ There is evidence that a substantial 
minority . . . will misinterpret or rationalize their response to 
almost any item, but particularly to those on which an unfavourable 
or direct response would engender subjective pain." These con- 
clusions are all the. more noteworthy in that Willoughby has, been 
responsible for several of the most extensive mass-investigations with 


personality questionnaires (1934—1936, 1937). Yet he now admits 
their “Susceptibility „to complete vitiation by compensatory 
mechanisms," and the i 


entire lack of control of the attitude of the 
testees towards the test and the tester, 
190: The influence of unconscious affective factors.—We do not 
merely have to reckon with conscious resistances, hesitations, and 
lack of candour among t 


) the teatees. Whether or not we are favourable 
towards psychoanalytic doctrines in general, we must admit that 
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psychoanalysts have demonstrated also the importance of uncon- 
scious resistances. People literally do not know themselves well 
enough to answer many of the questions correctly. Their responses 
are only too likely to be rationalizations or unwitting self-deceptions. 
In an illuminating discussion of personality questionnaires and 
psychoanalytic techniques, Alexander (1934) shows that the psycho- 
analyst employs methods for breaking down resistances and getting 
at the truth which are the very reverse of those used by the 
tester. Introspection and critical consideration are of little use to 
the former ; nor would he ever ask direct questions about personal 
sentiments and complexes until a suitable state of transference 
or rapport had been set up. Rather he relies on fantasy, dreams 
and free association, where conscious control is slackened. But in 
the application of these tests, conscious criticism is at a maximum. 
All that the tester can do is to ask for good co-operation, and promise 
that the results will be treated confidentially (cf. Vernon (1934b)). 
Again the tests only allow 2, 3, or at most 5 possible responses to 
each item, whereas the natural reactions of the testees will be 
infinitely varied. We have learnt also from psychoanalysis that 
words are a very inadequate medium in which, to ‘express our 
emotional tendencies, presumably because these tendencies are much 
older and more primitive, phylogenetically апа ontogenetically, 
than are our language capacities. How much more difficult must 
it be then to express them satisfactorily in terms of Yes, No, or 
numbers, and the like. 

161. Effects of testees’ co-operativeness.—We will attempt now 
to specify the main subjective factors which are likely to influence 
the testees’ responses, factors which are, however, irrelevant to the 
experimenter’s aims. First will come the conscious feelings of 
irritation, suspicion and resentment, which may lead in some cases 
to deliberate falsification. These will of course depend largely on 
the testees’ attitude to the experimenter, the extent to which 
they respect him, and their notions as to his object in applying 
the test. Obviously some testees will be much more conscientious 
and co-operative than others. The effects of these conscious 
attitudes are well illustrated by two investigations among college 
students. 1 

Hanna (1934) applied the Thurstone Schedule to 179 students who came 
to the College Clinic for psychological or vocational guidance. Presumably 
they answered it as frankly as possible in the hope that it would help them. 
Their scores were later compared with independent estimates by clinical 
psychologists of their degree of maladjustment, and quite goed agreement 
was found. For instance, of those who checked 0-29 psychoneurotic answers 
21 per cent, were classified as maladjusted ; among those who gave 30-74 the 
proportion was 58 per cent.; and among those who scored 75 or óver 79 per 
cent. were maladjusted, Moran (1935) applied a similar test to 189 students, 
41 of whom were classified on the basis of other evidence as maladjusted. 
But here the test was taken along with tests of abilities, etc., at the beginning 
of the College year, so that many of the testees may have thought that the 
authorities would be influenced by ‘the desirable or undesirable picture of 
themselves that the test conveyed. The result was that Ше was no appre- 
ciable difference between the responses of the well and maladjusted students ; 
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and subsequent investigation revealed many direct contradictions between 
the responses and their actual symptoms. 

162. Effects of testees’ self-analyticness.—Secondly, it would 
seem that some persons are far more introspective than others and 
more used to verbalizing their emotional experiences to themselves. 
We would not claim that they actually know themselves better. 
Indeed medical psychologists often state that such “ self-analysts 
are more difficult to psychoanalyze than are more naive and unself- 
conscious individuals ; and some confirmation for this view may 
be derived from experiments on the good and bad self-rater (§ 178). 
While it is unsafe to generalize too far, it is probable that the former 
are, on the whole, more intelligent and better educated than the latter. 
If this is true it would explain the remarkable fact, which has 
emerged again and again from personality questionnaire studies, 
that University students and members of the professions score much 
higher in psychoneurotic tendencies than do the relatively uncultured. 
Frequently they are found to be as unstable as mental hospital 
patients, i.e. as neurotics and psychotics drawn mainly from lower 
Social classes. (There seems also to be a slight tendency for the better 
students to be. more neurotic and more introverted, though the 
experimental evidence is far from unanimous on this point ; among 
children the reverse relationship is more often found.) Now 
students and professionals may inactual fact bemore neuroticand more 
introverted than other social groups ; but we would suggest that the 
main explanation of this result is that they are more aware of their 
emotional lives, and more willing to admit to themselves, and to the 
experimenter, the possession of the symptoms which the tests 
describe, Again, the unsophisticated persons who obtain lower 
Scores may be more stable ; but it is also probable that they do not 
realize their emotional weaknesses. This emotional self-conscious- 
ness factor may then bear but little relation to overt behaviour or 
other signs of PSychoncurosis, and yet influence considerably the 
number of Symptoms which are checked. 

163. Effects of testees" Suggestibility.—Another factor to be 
taken into account is suggestion. Investigations in legal psychology 
and experiments on memory have shown how extremely liable to 


Sses In answering the tests, others 

disposition may greatly exaggerate 

рода to what we termed SEAT 
С 1 е the form of the test (§ 25). . While 

Fr 1S as yet no direct evidence of the operation of dx M the 
ving three experiments appear to show its operation. 

S investigavion of shell-shocked Patients. In 1918 

applied the Woodworth test to groups of soldiers in à 


tly before tHe declaration of the armisti 
1 stice, and to others 
shortly afterwards? The average incidence of psychoneurotic symptoms 
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was roughly half as great in the latter as in the former group, and Holling- 
worth ascribes this to their different motivation. The former, being in great 
fear of eventual return to the firing line, consciously or unconsciously ascribed 
to themselves many symptoms which were rejected when this fear was 
removed. і 

165. Watson’s investigation of strict upbringing. С. В. Watson (1934) 
applied a long questionnaire on emotional development to 210 students, 
and apparently obtained exceptionally good co-operation and frankness. 
On the basis of the responses to 17 items he separated those subjects who had 
had very strict parents from those brought up in liberal homes. From the 
Tesponses of these groups to the other questions he found that the former 
believed themselves to have had poorer health, to be less well adjusted at 
school ; they disliked their teachers, had fewer friends, more broken engage- 
ments ; were more prone to day-dreaming and nightmares, запа soon. Now 
it may be that these results constitute a valuable proof of the claims that 
Clinical psychologists have long been making as to the ill-effects of an over- 
strict home. Other, more objective, evidence might indeed be brought 
forward to support Watson’s conclusions. And yet the results are so “ neat a 
that it is difficult to avoid the suspicion that the more neurotic students, 
who are generally sorry for themselves, tend both to rationalize away their 
maladjustment by presuming that they were badly treated in childhood, 
to remember more of the unpleasant, fewer of the pleasant, experiences ої 
family and school life, and to ascribe to themselves more present difficulties 
and problems than do the better adjusted. This is at, least a possible 
alternative explanation, which fits in well with the teachings of medical 
Psychologists. е 

166. Block’s investigation of adolescent worries. Thirdly, we would 
mention a study of sources of conflict between adolescents and their parents 
by Block (1937). A list of fifty possible sources was checked anonymously 
by five hundred 12 to 17 year old pupils in an American city. The instructions 
were to check only those items which were “ seriously disturbing ” and which 
made them “ very unhappy.” Typical results were that 86 per cent. of boys 
and 71 per cent. of girls checked : “ Won't let me use the саг”; and that 75 per 
cent. of boys and 64 per cent. of girls checked “ Pesters me about my table 
manners.” Now the relative incidence of these and the other sources of 
conflict in boys and girls of various ages would appear to be highly meaningful 
but it is very difficult to accept the absolute figures. Surely it is likely that 
large numbers of the testees, seizing the opportunity to pour out their grievan- 
ces anonymously, accepted a great many items which were suggested by the 
test, whether or not these were “ seriously disturbing." In none of these 
three investigations need we imply that there was any conscious falsification. 


167. Effects of mood.—It might be suspected that the testee's 
temporary mood would have a considerable effect on his responses, ' 
that when depressed he would appear more neurotic, introverted, or 
submissive than when optimistic. An experiment by Johnson (1934) 
shows, however, that though this does occur, the influence is 
quite small, and hardly significant statistically. e 

168. Reliability of self-rating tests.—In general the repeat 
reliability and the internal consistency (split-half reliability) of the 
tests are as high as they are among attitude tests. Coefficients of 
-- 0-85 or more are typical. Lower figures are of course obtained 
when the number of items is small. In Boyd’s Questionnaire the 
average coefficient for the sets of six items was + 0.583. Lentz 
(1934) and others find alterations in some 15-20 per cent. of items on 
retesting (some items being more liable to variatiofi than others, 
cf. § 159), but many of the changes cancel one another out. 
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169. Significance of high reliability—It seems probable that, 


from consistent 


subjective pain, i.e. towards consonance with an acceptable э 
ideal.” That the testee’s general self-estimate may influence hi 


cated in some minor experiments 
that testees who knew the object 
y high consistency. Usually the 
attempts to measure is withheld ; 


ere obtained among 
and Guilford (1936), and 
Бе inter-correlation of scores on separate 
measures was 0.24, but this was raised to 0-44 when the items were 
combined into a single test. 

171. Significance of results of factor analysis studies. We at 
now in a position tc'interpret the results obtained from multiple factor 


identical with one another, Stag, 
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analyses of self-rating tests. It was shown above (§ 149) that most 
of the tests, whether they are directed at introversion, or submissive- 
Ness, at tendency to fantasy, at parental or sex adjustments, at 
excitability, depression, worry, shrinking responsibility, or at 
neurotic tendencies in general, all actually measure one and the same 
general factor. Although additional factors were needed to account 
for the whole of the correlations between these diverse tendencies, 
they were usually of minor importance. A parallel state of affairs 
was found in Chapter IV, when ratings of others were subjected to 
factorization ; and it appeared there that the general factor consisted 
largely of halo. The writer would suggest then that the general 
factor in self-rating tests similarly consists of the subjective attitudes 
that we have been considering. Probably it is a highly complex 
tendency. Those testees who obtain big scores on psychoneurotic, 
introverted and other traits may be the most conscientious, the 
Most willing to reveal their emotional weaknesses to the experimenter, 
or they may be the most self-analytic and sophisticated, the most 
conscious of their weaknesses ; or they may be the most suggestible 
and neurasthenic. While those who obtain generally low,scores may 
resent the experimenter's curiosity and consciously attempt to draw 
a favourable picture of themselves or they may be less introspective, 
more “ tough-minded.” This interpretation appears to fit all the 
experimental facts cited above, and to accord with the more specu- 
lative generalizations about the subjective state of mind of the testees 
Which we derived from psychoanalytic and other psychological 
considerations. 

179. Validity of self-rating tests.—Thus we are faced with much 
the same problem as in the previous chapter. It was clear there 
that the general factor did not consist merely of halo in the minds 
of the raters, but also represented in part general strength or weak- 
ness of character in the ratees. Similarly it is probable that the 
general factor in self-rating tests does in part correspond to a genuine 
maladjusted-psychoneurotic-introverted tendency, which is mani- 
fested both in overt behaviour and in the judgments of acquaintances. 
Indeed the two general factors might even turn out to be identical 
if we could remove halo from the one, and subjective attitude dis- 
tortions from the other. Again there might be close correspondence 
between our general factor and Burt’s general emotionality (§ 118), 
also between the commonly discovered sociality factor (§s 145, 
147-149) and Burt’s sthenic-asthenic dimensien. But while the 
separation of halo from ratings did appear to be feasible (§ 113), 
the purification of self-ratings is at present quite impracticable. 
Hence we are not entitled to claim that self-rating tests directly 
measure the traits by which they are called. Nor can we say 
whether the second and subsequent factors (obtained after the first 
has been partialled out) possess a greater validity, or are more free 
from “ self-halo ” effects, that the first. à 

173. Empirical evidence of validity..—We would therefore expect 
investigations of the validity of self-rating tests to give on the whole 
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positive, but variable and rather poor, results. The writer has 
collected from the literature 44 results of comparisons, by 22 investi- 
gators, of the tests with ratings of the testees on the same traits by 
associates ; the average correlation is + 0-40, the range — 0:15 E 
+ 0-79. (Here of course both the raters’ and testees’ halos te 
to reduce the agreement). In a detailed investigation of a sma 

&roup of college students, the writer constructed batteries of E 
of various types for measuring a number of personality traits, ап 

so was able to ¢alculate the validity of the different tests. Externe 
ratings yielded validity coefficients averaging + 0-60 + -09, an 

five of the self-rating tests described above gave an average ipn 
of + 0-45 + -11, Though these figures are low when compare! 

with the validity of intelligence or educational tests, they are 21 
good as, or better than, similar correlations obtained with typica 
objective tests of temperament and character. And if it is desired 
to develop batteries of tests which will measure traits as accurately 
as possible, then there can be no doubt that such batteries should 
include external and self-ratings. In spite of their defects they. do, 
under good conditions, provide data of partial validity. 

174. Other evidence may be derived from comparisons of the test 
scores of groups with known characteristics. As already mentioned 
(§ 161), well and badly adjusted students, do show differences if the 
testing conditions are favourable. Thurstone and Thurstone (1930), 
Bernreuter (1933a), Stagner (1934) and others have also claimed that 
extreme scores are genuinely diagnostic of personality disorders ; 
but Landis (1932), Downey (1932), etc. record great discrepancies 
between the test results and other information about the testees 
Personalities. Some investigators (cf.. Mathews (1923), have shown 
delinquent children to be. somewhat more neurotic than normals. 
Murray (1932) however’ discovered no difference, but did obtain 


differentiated Significantly between groups of 15-18 year problem 
Cases or delinquents and normals. But he does not describe 
the kind of Co-operation he obtained from his testees. Simpson 
(1934a) claims a moderate correlation between the Thurstone 
(number of incarcerations) among adult 


175. Results with psychotics and neurotics.—Hunt (1936) sum- 
marizes a number of experiments with 

points out how much depends on the i 
obtaining candid res 
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depressives obtained scores lower by 8 per cent. on the average than 
the other groups ; but the normals and schizophrenics were almost 
identical. Nevertheless, а few of the separate test items did 
differentiate fairly reliably between the three groups. The authors 
conclude therefore that the pattern of responses given by different 
types of patient may be meaningful, although purely quantitative 
comparisons based on the number of responses are valueless. In 
another study by Landis, Zubin and Katz (1935), the Bernreuter 
Scores gave no significant differentiation between gloups of schizo- 
phrenics, manic-depressives, organic psychotics, psychoneurotics 
and normals. But the Maller Character Sketches, a test which 
was in the first place standardized empirically on abnormal 
testees, did show somewhat poorer adjustment among the 
psychoneurotics. 

176. Other results indicating validity. —An interesting study by 
Boynton, Dugger and Turner (1934) demonstrated a greater incidence 
ОЁ symptoms in school children who were taught by teachers with 
high scores than in children under the charge of more stable teachers. 
Carter (1935) obtained distinctly greater resemblance between the 
Bernreuter scores of 55 pairs of identical twins than between those 
of 74 pairs of non-identicals ; like-sex fratermals were also more 
similar to each other than unlike-sex pairs. He interprets this as 
evidence for the partial determination of the test performances 
by hereditary emotional factors. Studies of family resemblante 
usually give small positive correlations between husbands and wives, 
parents and children (cf. Willoughby (1934-36) ; Hoffditz (1934)). 
Meaningful sex differences occur in tests of dominance and, as 
reported above ($ 146) in the Boyd Questionnaire. 

177. Conclusions and recommendations.—Such results tend to 
agree with expectations, but are not sufficiently striking to prove 
that the tests possess great practical value. We are probably 
justified in concluding that they do measure psychologically signifi- 
cant variables when the testees are adequately motivated to give 
candid responses. It is likely also that test items which deal with 
concrete manifestations of personality traits rather than with 
intimate feelings, and which stress socially acceptable as much as, 
or more than, unacceptable characteristics, are superior. Almost 
certainly it is better to approach one trait at a time than to attempt 
to cover several in a single test ; and factorial analysis techniques 
might well be employed in deciding on those tra its'that are sufficiently 
clear-cut to be tested. Finally, it is vesy necessary to interpret 

]ts with caution, remembering that they are always 

subjective factors that cannot be fully cóntrollei 
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D.—Innirect MEASURES OBTAINED FROM SELF-RATINGS 
(i) Goodness of Self-Rating ` 

178. When a testee rates himself, and is rated by acquaintances, 
‚оп a number of different traits, then the agreement between his own 
opinions of himself and others' opinions is sometimes assumed to 
constitute a measure of his self-insight. Should his ratings not 
merely deviate from those of others, but always be too high on 
desirable, too low on undesirable traits, we obtain an index of his 
conceitedness. Allport (1921, 1937) discovered that this self- 
overevaluation correlates negatively with intelligence and with sense 
of humour; others also have shown the more intelligent to be more 


though also more intelligent, tend to be more introverted (cf. § 121). 
Jn other words, the person who is interested in others knows 
himself well, and the person who is interested in himself knows 


(li) The Checking of Extreme Responses, and Variability, in 
Self-Ratings 

179. Just as in attitude tests (§s 74-75), so in personality 
questionnaires with multiple choice responses, some testees tend to 
check many more extreme responses than others, both in the 
neurotic and in the non-neurotic direction. In the Boyd Question- 
naire, the writer found that the proportion of definite Yes's and 
No's (as contrasted with Yes?, 0, and No?) ranged from 17 per cent. 
to 98 per cent. Again, however, this tendency to extremes appears 
ssess little psychological significance. 

: dividual differences in the amount of 
alteration of responses, when the Bernreuter test is given twice, 
but, as with attitude tests (cf. $ 76) obtained few results of interest. 


VL—WORD ASSOCIATION METHODS AND INTEREST 
BLANKS ; 
A. DESCRIPTION OF TESTS 


, 


_ 180. The chief difference between the following tests and those 


| t y do not set out primarily to measure 
any particular trait, interest or attitude. They might be termed 
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buckshot ” approaches to personality, in that they fire large 
numbers of stimuli at the testee, indiscriminately, in the hope that 
several will reach the mark and will touch off some significant 
emotional response. Then from these responses а great many traits 
may be deduced. We will first briefly outline the tests and then 
show how the responses are treated so as to yield measures of traits 
and interests. 


(ii) The Word Association Method | 


181. First devized by Galton in 1879, the word association 
method has evolved in many directions, and has „been put to a 
variety of uses. In its commonest form a list of stimulus words is 
read out, one by one, by the experimenter. To each stimulus the 
testee is instructed to reply with the first word that comes to mind ; 
Dot to search about for apt associations but to respond immediately 
with the first thing he thinks of. Many of the stimuli tend to evoke 
Superficial verbal habits, e.g. “ father—mother,” “ black—white, " 
etc., but a few may touch on the testee’s emotional complexes. In 
those cases he may show hesitation or embarrassment, kis response 
may be delayed, and the response may be some very unusual word. 
The test therefore provides the experimenter with a lead as to the 
emotional dispositions around which the testee’s life is centred. 

182. Further developments of the method.—Many refinements of 
the technique may be employed. The exact time between stimulus 
and response may be recorded with chronoscope or stop-watch ; 
reaction times which greatly exceed 2 seconds suggest some mental 
inhibition, Simultaneous records may be taken of the psychogalvanic 
reflex; large alterations in electrical resistance are believed to 
accompany emotional disturbances. Luria and others have obtained 
fruitful results with an apparatus which records the testee’s muscular 
tonus ; this also seems to vary significantly with mental tension. 
We cannot however deal here with these measures, since our concern 
is with verbal methods of approach to personality. 

. 188. Stimulus word lists.—Jung’s (1916, 1919) list of 100 words 
is frequently employed, since it contains stimuli likely to evoke a 
large number of complexes. Kent and Rosanoff's (1910-11) list 
was selected for a different purpose, and includes words which do not 
often “call up personal experiences.” In some investigations 
this has been given as a group test, the testees writing their own 
responses. Here, of course, no timing or other observation of indi- 
viduals is possible ; scoring must be based.solely on the tontent of 
the responses. Useful word lists for individual application to 
children at а Child ЕПЗ?) lcs pave ‘bees published by Cattell 
Г Burt c). Another, devized b 
(1936), ‘and БУ jl Scottish Clinics. E AE 


lished), is used at 5 iati PTE б ^ 
184. Other types of association tests.—In “ chain" or “ con- 


Ж? association tests, asus "Word is given and the testee 
is instructed to say every thing that cemes to mind*in connection 
with it, oF to take each response as stimulus for a fresh association 
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Meltzer (1935), for instance, devized a fruitful method of studying 
children’s attitudes to their parents, by getting them to “ think 
aloud” about certain stimuli. He first gave some innocuous 
practice words like “ table, ball, Roosevelt,” and then “ father 
and “ mother." The first ten associations to the latter words were 
recorded. Д 
185. Murray’s Thematic Apperception Test.—Another test which 
seems to possess considerable possibilities in child and student 
guidance has Feen called by Morgan and Murray (1935) the Thematic 
Apperception Test. The testee is shown a series of some 10 to 20 
pictures in an individual interview, and told to make up a short 
story to fit each picture, giving free rein to his fantasy. The stories 
are recorded verbatim. The pictures show a variety of incidents, 
but in each, of them is portrayed some person of the same sex, ап 
about the same age, as the testee. It is found that he tends to 
project his own needs, sentiments and complexes (conscious ОГ 
unconscious) into the stories, so that the general themes of the 
“stories correspond in a remarkable manner to his inner emotiona 
life. It is probably easier to obtain good co-operation and revelatory 
responses by this method than by the more formal word association 
methods. It falls, however, somewhat outside the scope of this 
Report, since the stimuli employed are pictorial, not verbal. There 
are several other analogous tests, e.g. the Rorschach inkblots, where 
verbal free associations are obtained to a series of meaningless black 


and white, or coloured, inkblots. With these we will not attempt 
to deal here. 


(iii) The Pressey X-O Tests 


186. The Pressey X-O Tests were originally devized, like word 
association tests, not 2s direct measures of any particular traits, 
but as exploratory instruments (to be applied in group form) by 
means of which the experimenter could find out what stimuli would 
arouse various kinds of emotional response in the testees, (cf. Pressey 
and Chambers (1920), Pressey (1921)). Form A, for adults, contains 
four subtests, each of which consists of a printed list of 125 words, 
25 lines of 5 words each. In the first subtest the testees are instructe 
to cross out (hence the name XO) all the words that are unpleasan 
to them ; and then to encircle the one word in each line which 15 
most unpleasant. Of the five words in a line, one is presumed to be 
unemotional or a “joker”, one refers to disgust tendencies, one tO 
fears, ong to sex, and one to Suspicions ; e.g..— 

White drunk hoke flirt ^ unfair 

In 

in capi 
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4 ч Supposed to refer to special types of abnormality, paranoid, 
eurotic, schizoid, melancholic and hyperchondriacal, e.g. : 
injustice noise  self-consciousness discouragement 
ot in Form B, for children, three similar subtests call for reactions 
thi wrong, worry and like or interest." Collins (1927) has adapted 
d S Form for use in Britain and has published it in full. e will 
escribe later the various methods of scoring. 


germs 


(iv) Interest Blanks 4 


187. Vocational psychologists were early faced with the problem 
that an individual’s estimates of his own occupational interests 
are often fluctuating and unreliable. He may have very incomplete 
Notions of what the different occupations entail, so that his stated, 
preferences may possess scarcely any value for vocational guidance. 

Buckshot ” methods were therefore devized, first at the Carnegie 
Institute of Technology, and later at Stanford University, which 
record the testee's immediate likes and dislikes for a large nuniber of 
miscellaneous stimuli, and then deduce his true interests from the 


total pattern of his responses. We will omit the early interest 
blanks of Moore, Ream and Freyd, and turn at oace to Strong's 
(1927) Vocational Interest Blank, which has Superseded them. Their 
évolution is described by Symonds (1931) and Fryer (1931). . 
188. Strong’s Vocational Interest Blank.—This Blank lists :— 
I. 100 occupations, e.g. : Actor, Advertiser . . . · Wholesaler, Y.M.C.À. 


Worker. 
: Golf, Tennis, Chess, Pet Canaries . . 


II. 54 amusements, e.g. Ais 

ТЇЇ. 39 subjects of study, е.8.: Algebra, Agriculture .... Typewriting, 
Zoology. 

IV. 52 miscellaneous activities, e.g. Repairing а clock, Arguments, 
Saving money . . - - 

У. 53 types of people, e.g. Optimists, Pessimists, Foreigners, Cripples, 


Teetotalers . . . . · 
After each of these is printed L I D, and the testee is told to 


encircle one of these letters to represent like, indifference or dislike, 


respectively. 
VI. Four lists of ten activities 
three in each list he would most lik 
dislike. E.g. one list includes : 
Caruso, Edison, J. P. Morgan, Pershing, Henry Ford, etc. 
VII. Comparison of preferences for 42 miscellaneous pairs of activities, 
CDs Playing baseball v. Watching baseball. 
Dealing with things v. Dealing with people of 
vim. A self-rating test of 40 items, e.g.: ® А, 
Get rattled easily 
Am always on time with my work. ^ 
n A Н ? 
Each of these 15 checked eir E om i 
fi takes approxima i 1 in a variet ot in these 404 responses, 
They may then be trot z Y, ot ways so as to throw light 


or types of people: the testee checks the 
e to do or be, and the three he would most 


i i . Manson (1931) has ] 

any different interests. Ma ) has produced a simi 
dun for women's vocational interests, Garretson (1930) another 
for the educational interests of secondary school boys. дона 
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189. Studies of children’s interests.— We should mention in 
passing the innumerable investigations, from Stanley Hall onwards, 
of children's interests, by means of long lists of games, favourite 


books, etc., which are similarly checked for Like or Dislike. Lehman  j 


and Witty (1927) have conducted many studies among thousands of 
American children with their Play Quiz. Furfey’s (1928) Develop- 
mental Age Test includes lists of interests common among boys from 
8 to 18 years. Terman (1925) made considerable use of the method 
in his studies îof the differences between highly intelligent or gifted 
and normal children. He also investigated sex differences, and 
constructed a provisional classification of play interests into 
masculine and feminine. е 
,,,,190. Masculine-feminine interests.—Since then Terman and 
Miles (1936) have constructed a “buckshot” test which is 45 
voluminous as, and even more varied than, Strong’s Blank, for 
«studying sex differences in adults. This they call the M-F Attitude 
Interest Analysis Test. Its seven subtests include a controlled 
"association test (stimulus words with four responses to choose from, 
analogous to Pressey X-O); a similar test with inkblot instead of 
verbal stimuli; a multiple choice test of general knowledge ; lists 
of things and activities which may evoke anger, fear, disgust, pity, 
blame, etc.; likes and dislikes ; and a personality self-rating 
questionnaire. 
Я В. QUALITATIVE USES OF THE TESTS 


191. The primary use of free word association or continuous 
association is clinical. The following-up of unusual responses, 
or of responses which are accompanied by other “ complex indi- 
cators " (delayed reaction, emotion, etc.) is of more value to the 
psychiatrist or psychoanalyst than are any of the derived quantita- 
tive measure; which we describe below. The same might be said 
of the Pressey X-O Test. Collins (1927), Tjaden (1926) and others 
find the various scores which it yields less useful than the exploration 
of particular responses in a clinical interview. Shellow (1931) 
recommends a similar qualitative application of the Interest Blanks. 
in a vocational guidance interview. Such usages are hardly 
scientific, and do not fall within our present purview. 


C.—AGGREGATE QUANTITATIVE SCORES 
192. Word association scores.—In word association, the average 
reaction time for all the words, or its dispersion, also the average 
psychogaivanic or kinaesthetic response, and the total numbers 0 
complex indicators have often been investigated as measures 0 
emotionality, or emotional conflict, and the like. Their significance 
is somewhat doubtful, since they fail to correlate well with апу 
other measures of 


1 psychological traits. These al at 
ovtside our scope. ? a Mec Ene 


198. X-O affectively score.— The total numbe d 
out in the А-О Tests hasbeen t ү мышыгы 


E C ermed by Presse o 
“ Affectivity " or “ Richness in emotional association es "split 
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pa reliability and repeat reliability over a few days are high, but 
Wh latter falls off rapidly over longer intervals (сї. McGeoch and 
itely (1927) ; Thompson and Remmers (1928)). Correlations with 
peer presumed measures of emotionality such as the Woodworth 

mventory, etc., are generally negligible. Neither does it consistently 
differentiate delinquent or psychopathological groups from normal 
Persons. On the separate subtests, however, delinquents do on the 
average appear to have more worries and to regard fewer things as 
wrong (cf. Courthial (1931) ; Bridges and Bridges (1926}). The author 
of the test admits that the Affectivity score is a blur of many kinds of 
response, and ‘does not expect it to have much meaning. It would 
seem to the present writer that an additional reason for the failure, 
but of the others described below, may, 
of the testee's attitudes to the tests. 
blaming, etc. in so many 
ed significance can be attached to 


watch which the clinician keeps on 
free association, and the haphazardne 

194. Total interest scores.—Strong (1931) sometimes interprets 
the total number of Likes in his Blank as a measure of the breadth 
or range of interests. The writer, in, applying а similar Blank, 
found very large variations in this respect ; some testees gave as 
few as 35 per cent., others as many as 70 per cent. of L checks. it 
would seem to him, however, that this measure does not so much 
represent a trait of “ likingness ” or “ optimism ” in the testees, 
as an extraneous and irrelevant factor, somewhat akin to the 
checking of moderate or extreme responses in an attitude test 
(cf. 8s 74-75), or to the varying standards that raters adopt (cf. 8 4). 
It may be due to the diverse ways in which the testees interpret 
the meaning of Like, Dislike and Indifferent. | 

195. Interpretations of Like and Dislike.—Extreme instances of 
such varying interpretations of Interest Blank responses were observed 
by the writer in applying a short questionnaire on play interests to 
some 2,000 Glasgow school children. In the younger classes a 
number of the children had to be assisted individually. Some 
clearly took Like to mean ^ socially acceptable," ie. what their 
friends usually play with ; others were more concerned as to what 
would please the teacher. One boy at a Child Guidance Clinic who 
spent all his time playing with dolls, marked every item in the blank 
Like, except the item Playing with Dolls," which was given a 
Dislike. Presumably that was the best way he could indicate that 
dolls meant something special to him, or that other children had 
teased “him about them. | Though adults are seldom likely to n: 
quite аз perverse aS this*, yet certainly the meaning of Like and 


testees will be tempted indul; 
» Surely however some ike, “ E Pond re ir se 
рштошг when asked if they like, " Pet canaries, Chopping ол saine 
of dits in sherriff's posse, Acting aH yell-ieader, People with | ld соп 
Меп Pho use perfume," and other such items in the Strong Blank, 5 Kd 
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Dislike responses requires much more analysis than it has received 
so far. 

196. Significance of total Likes.—Thorndike (1936) has recently 
shown concern over this general liking tendency, since it interfered 
with his studies of types of interests (сї. § 203). From observations 
of the behaviour of several of his testees he is inclined to the view 
that there is a real difference in breadth of interests between those 
who check mary and few Likes, and that this accounts for about 
опе half of thé total Likes scores; the other half he ascribes to 
“ constant errors in interpreting and using the L_D scale.” 


в . D. CLASSIFICATION OF RESPONSES INTO TYPES 


197. Word association classifications. 
attempted to classify the main types of 
and then to find what proportions of a 
each heading. One of the best kno 
(1916, 1919). He distinguishes Intrinsic associations (similarity of 
meaning of stimulus and response) ; Extrinsic associations (contiguity 
of stimulus and response in Space or time) ; Clang or sound associa- 
tions ; Miscellaneous, including purely personal, associations, and 
Several sub-classes. Others (e.g. Freyd (1924b)) have classified 
responses into subjective or egocentric, and objective. Kent and 
Rosanoff (1910-11) point out pertinently that not only do different 
psychologists prepare different typologies, but that also they may 
vary anything up to 35 per cent. in the responses that they assign 
to any one type. Murphy (1923) has shown that, even with the 
most careful classification, there is 
types to which men 
Wells's (1919) 
the types by compariscn with ratings were similarly unsuccessful. 
Classifications based more on the conte 
grammatical nature of the associ: 
Significant. Gilliland (1926) used the aggressiveness of responses 


enterprize," as one of a battery 
of tests for aggressiveness. Fisher and Marrow (1934) showed that 


—Many investigators have 
word association responses, 
testee’s responses fall under 
wn classifications is Jung’s 


- Classification of’ continuous associations and fantasies.— 
Meltzer (1935, 1936) classified children’s 
stimulus words “ father ” 


. fairly reliable, but to possess poor diagnostic vali: 


95 


children from middle class homes, less healthy from very rich, and 
Worst of all from the poorest homes. 

199. Similar treatment is being applied to Murray's Thematic 
Apperception Test ($189). The various stories are classified so as to 
yield estimates of the strength of a number of traits, complexes, etc. 
in the testees. The results of this treatment are not yet published, 
and at present the test is being applied in an almost wholly qualitative 
manner ; the clinician has to interpret from the stories the testee's 
main personality trends. y 

200. X-O classified scores.—As mentioned above, the words in 
two of the Pressey X-0 Form A subtests are already classified under 
nine types of abnormality. Allen (1927) and Flugel and Radclyffe 
(1928) have studied the typological scores (i.e. the number of words 


under each heading that are crossed out), and find them to be 
idity when compared 


with case studies or with answers to questionnaires dealing with tlie 
Same tendencies. T 

901. Types of interests.—No systematic classification of interest 
blank items has been attempted, though the need for*it is shown 
by the frequency with which investigators reSort to ad hoc 
classifications in interpreting their experimental results. Thus 
when Strong (1931) studied the relative Likes and Dislikes of large 
groups of testees of varying ages, he found marked age differences 
in items which seemed to fall under such general headings as physical 
exploits, linguistic subjects of study, administrative occupations; 
etc. Similarly in contrasting happily married, unhappily married 
and divorced couples, Johnson and Terman (1935) found the main 
differences between the groups to lie in logically coherent sets of 
items, e.g. “ uplift interests,” “ intellectual, interests, ^ etc. 

202. The present writer prepared an Interest Blank with items 
referring to Spranger’s six types of value: (cf. 8s 80, 66). When 
applied to testees who had also taken an early draft of the Study 
of Values, a contingency coefficient of 0-47 (approximately 0-70 if 
corrected for attenuation) between the two sets of scores was obtained, 
showing that the two different techniques do to some extent measure 
the same general interests. It is doubtful however whether one may 
use the absolute number of Like and Dislike responses which belong 
to any one type as an index of the testee's interest in that type 
since the general “ likingness " factor, which we described above 
8s 194-196) may affect all such scores. In order to eliminate its 
influence, the writer employed instead thé relative proportions of a 
› onses that belonged to each type. 
testee's TeSpo y В 

202. Thorndike ee к has done ‘extensive work on the 
measurement of interests by а >an made up of short sets of items 
5 a“ Practical activities, Animals, Language, Religi 
» and several other headings. He £ engion, 

tained high absolute scores i gs. Че also found 

that some testees 0 EM es in all.classes, so that 
the correlations between nam DL Were to some extent 
spurious: However, when the general likingness factor was held 
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constant by the partial correlation technique, there was still a fair _ 
amount of overlapping between the different classes. For instance, | 
“ Art, Music, Words and Imagination " classes inter-correlated 
highly, indicating that they all belong to some wider category. 
Similarly, '' Detail, System, Neatness," and “ Animals, Beauty, 
Children " were linked. 
It is clear that much more investigation is needed as to the types 

of interests which are both self-consistent and distinctive from one 
another. Strong’s (1931), Johnson and Terman’s (1935), Thorn- 
dike's (1935, 1936), Cattell's (1936) and the writer's classifications 
are all different, and are all lacking in an empirical basis. We shall 
See below that multiple factor analysis might be a very useful 
*instrument for studying this problem. 


E. EMPIRICAL TREATMENT OF WORD ASSOCIATION TESTS 

204. Empirical techniques.— These interpretative qualitative and 
typological approaches are generally distrusted by American psy- 
chologists ; hence by far the largest proportion of work on the tests 
described in this Chapter has been conducted with purely empirical 
techniques. Much the same method is always used. A long list of 
verbal stimuli (free association words, X-O words, L I D items, etc.) 
is applied to a large group of persons, whom we shall call the St-group 
(Standardization group). In its simplest form, all members of this 
8roup possess some common psychological characteristic ; eg. they f 
are mentally normal as distinct from psychotic or neurotic. Their 
Tesponses to each stimulus are tabulated. The test is now given to 
the testees (who are quite distinct from the St-group) ; their responses ` | 
are compared with the responses of the St-group and are scored 
according to their degree of resemblance. If they differ widely, 
the testees are deemed mentally abnormal. Alternatively, the 
St-group includes two classes of people, 
and members of other vocations. The main differences between the 
responses of the two classes are tabulated, and transposed statisti- 
cally into some convenient form, 
closely resemble those of the life i 


, Olson in preparing his rating scale 
in his Personality ес (§ 148). 
205. Kent and Rosanoff'S work.—It was first appli ent 
and Rosanoff (1910-11) to word associations. Thee: ЧЕ ONS i 
the responses of a St-group of 1,000 miscellaneous normal ersons to 
their list of 100 stimulus words, апа noted the ENERE A 
response. When a new testee gives his associations to The same 
words, the frequency values of all his те mmed to give 
a rneasure of what is called his commonality (i E m og e) 
oridiogyncrasy (а low score). An alte SR LETS : igh scor 
is to note the number of common res; 
frequent responses of the St-group), 
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гроте (ie. responses not listed in the tables owing to the in- 
ae quency of their appearance). That the test possesses some value 
e measure of mental normality was shown by Kent and Rosanoff’s 
of ERU that the 1,000 members of the St-group gave an average 
per cent. individual responses each, whereas 247 mental patients 
Save an average of 27 per cent. each. 
e 906. Kent and Rosanoff's tables are now out of date, since the 
Common responses to their stimulus words have altered considerably 
Since 1910. O'Connor's (1928) tables, based on the responses of 
2,000 industrial workers are more often used, (though whether such 
Workers can legitimately be considered a representative normal 
St-group seems doubtful) The present writer finds a correlation of 
-F0-79 + -014 between the frequency values of 350 responses 
Selected at random from the Kent-Rosano 
oodrow and Lowell (1916), using а written ins 


children's commonest responses 
among only 39 per cent. of the words. 


mmonality score has been very 
nally claimed to show “ autistic 
thinking," but was later widely identified with introversion (Allport 
(1921) ; Freyd (1924b) ; Guthrie (1927) ; O’Connor (1928) ; Oliver 
(1930) ; Schwegler (1929) ; Weber and Maijgren (1929), the 
iti i rt would give fewer individual 


responses. Others have assumed that idiosyncrasy indicates origin- 
ality (McClatchy (1928)), high intelligence (Olson (1929) ; Wells 
(1919)), low intelligence (Wheat (1931)), the probable reason for the 


latter being that duller testees fail to understand some of the words 


and so give unusual responses. Tt has also been used as a measure of 
)), and of emotion- 


radicalism as opposed to conservatism (Moore (1925 
ality (Elonen and. Woodrow (1928); Laslett and Bennett (1934)). 
In actual fact the agreement with self-rating tests or external 
ratings on introversion are negligible, and correlations with measures 
of the other suggested traits are so inconsistent as to be valueless. 
So that our only clue to its meaning is Kent and Rosanoff's result 
with mental patients, quoted above. 

908. Wyman's word association measures of interests. —The next 
development was carried out by Wyman (1925) and Kelley. S.- 
groups of children were selected on the basis of teachers' ratings as 
being keenly ór weakly interested in * intellectual, social or activity 
interests." Their word association responses Were tabulated, and 
differential marks Were calculated for each response. Thus, when the 

t of childrens their responses can be scered 


test is applied to a new Sê 1 " ваше 
for resemblance to those of the intellectual, social or activity interest 
groups, and measures of these interests obtained. The technique is 


exceedingly laborious, yet it is objective, and has the considerable 
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advantage that the testees are not liable to fake their repo 
since they can have no conception as to what the test is measur! E 
The various halo effects found in ratings and self-rating tests ur 
eliminated. Probably, however, halo played a considerable part д 
the teachers’ ratings by means of which the St-groups were chose 
since unduly high correlations of + 0-68 to + 0:80 are found betwe T 
the three interest scores. Thus Wyman’s test measures chilar aa 
resemblance to groups reputed to possess such interests, not E 
resemblance t9 unambiguous and objectively defined groups Ste 
Kent and Rosanoff’s or Strong’s. A worse defect is that the NE. 
groups were to» small, numbering about 130 for each interest, sot of 
the scores are inconsistent. When the scores of fresh groups 2 
children were compared with similar ratings, the correlations wer 
only + 0-54, + 0-35 and + 0-20, A à 
209. Kelley's word association measures of character.—Still ne 
elaborate and still less successful was Kelley’s attempt to шс 
eight character traits objectively by the same technique (cf. Ke fe 
and Krey (1934)). Teachers’ and pupils’ ratings were used fo 
selecting St-groups who should be characterized by ‘ Cone 
Fair play, Honesty, Loyalty to fellows, Mastery, Poise, Regard Ж. 
Property rights, ана School drive.” When the word ome 
test was applied to a fresh group and their eight scores compared wi A 
similar ratings, the correlations averaged only + 0-18. Thes 
Scores also overlapped very closely with one another (cf. $ 117). d 
210. It is probable that by taking large enough St-groups m 
numerous enough word associations somewhat better ies 
might be obtained. Yet it is difficult to see why anyorie sho d 
assume that courtesy and the like will be expressed in word и 
ciations. These investigators would not attempt to measure arith- 
metical ability by a test which did not involve arithmetical xe. Д 
and yet they expect meaningless methods to yield meaningfu 
results in the field of personality. An examination of some 0 
Wyman’s scoring tables reveals scarcely any logical relation between 
the associations and the types of interest they are intended to 


measure. Forinstance, with the stimulus word “Сет”, the response 
“ Diamond " scores :— 


20 for intellectual, 11 for social and 
And the response “ Cake ” scores — 


3 for intellectual, 9 for social and 12 for 


15 for activity interests. 


‹ Е. EMPIRICAL TREATMENT OF X—O AND INTEREST TESTS 

211. X-O idiosynerasy scores. —In addition to crossing out 
words the testee is instructed to encircle the one Word in each line 
about which he feels most Strongly. Pressey therefore claims to 
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ш сше affective idiosyncrasy by 
үсе 5 choices differ from the modal word 
y a St-group. His original St-groups 
students for Form A and 388 students for Form B. The first group 
i much too small, and the second provides an inadequate criterion 
Ог scoring the responses of children. Thus it is not surprizing that 
me split half and repeat reliabilities of the idiosyncrasy scores аге 
ow (0-34 to 0-60 according to Flemming (1928); McGeoch and 
Whitely (1927) ; Thompson and Remmers (1928))./ Collins (1927) 
has provided useful lists of the commonest responses among ДЕ 
British children for each sex and age group from 11 + to 14 + years. 
But the test does not appear to have had any further use in this 
country, Idiosyncrasy scores are found to be somewhat larger in 


mentally abnormal and delinquent than in normal groups; thus 


Collins obtains averages о: 
delinquents and normals respectively (cf. al 

(1926) ; Guilford (1926); 
Word association measures, they fail to 
consistent correlations with any other measures of emotional traits 
(cf. Flemm-ing (1928) ; Landis, Gullette and J 
919, Chambers (1925ab) and Weber (1932) have worked out 
differential „score values, analogous to Wyman’s, from St-groups of 
scholastically superior and inferior students, and from boys of 
different ages, so that the test may be scored for scholastic interests 
and for maturity of emotional responses. The former measure 
gave no agreement with scholastic success when tried out on a fresh 
group (cf. Thompson and Remmers (1928)). Weber’s Emotional 
Age Scale, however, which incorporates much fresh material such as 
lists of liked or disliked books and games along with the X-O Form 
B Tests, does show better evidence of consistency’ and validity. 
Though it has not been widely applied, it. might provide a useful 
supplement to tests of intellectual development. Furfey’s Develop- 
mental Age Test is similar (cf. § 189). 
218. Vocational interest scores.—Vocational interest blanks are 
always standardized by reference to the responses of St-groups of 
persons engaged in particular vocations. The early tests of Freyd 
(1924b) and others were too short, and were applied to insufficiently 
large groups ; hence the responses found to be typical of groups of 
salesmen and mechanics failed to differentiate effectively between 
other salesmen and mechanics. The same criticism applies to some 
of Strong’s vocational groups, though most of them are very large. 
They should include about five hundred members of a vocation if the 
test is.to attain an adequate consistency." Strong's Blank (1927) 
may now be scored for the resemblance of a testee’s responses to the 
responses of some thirty different vocational groups, including 
Artists, Boy Scout Masters; Certified Public Accountants, 


Architects, ; ^ 
Chemists, Doctors, Farmers, Journalists, Policemen. Psychologists, 
Teachers and Vacuum Cleaner SaleSmen. Needless to say his 


test cannot legitimately be employed in this country, since it is 


finding the number of times а 
s most commonly encircled 
consisted of 114 College 
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far from likely that the likes and dislikes of Californian and British 
vocational” groups will be sufficiently similar. Manson’s (1931) 
Blank may be scored for ten common women's vocations ; Garretson 5 
(1930) for three main types of secondary education,: Academic, 
Commercial and Technical. Е 
214. Derivation of vocational interest scores.—We will outline 
briefly the method by which Strong determines the various interest 
marks for each item. Several different statistical techniques have 
been employed: by different compilers of empirical tests (e.g. Allport 
(1928); Kelley and Krey (1934); Flanagan (1935) ; Humm and 
Wadsworth (1935)). But Strong finds the following simple technique 
as effective as any. Suppose we wish to calculate the marks for 
interest in Personnel Management which are to be assigned to the 
item: ACTORLID. The percentages of the St-group of personnel 
managers who check each Tesponse are noted, also the percentages 
ûf members of all other vocational groups, with the following 


result :— 
j L И р 
Personnel Managers 49 38 13 
Ail others  .. 5. SS 353807 


Difference +11 +3 —14 


The differences are then transposed into somewhat smaller figures, 
and the final marks for this item are + 2, + 1 and — 3, respectively. 
Other marks are similarly determined for each of the 1,200 odd 
possible responses, for each vocational group. Ап individual 
testee’s score for a vocational interest is the sum of his + and EE 
marks, as derived from these groups. Scoring a single .testee's 
blank on 30 vocational interests takes Several hours, unless a Hollerith 


tenuation) Naturally, they are 
Persons who are tested before 
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E 216. In other researches Strong (1931) has obtained results 
m St-groups of different ages from 15 to 55 years, so that the 
х may be scored for “ interest maturity,” and from the two 
X groups. Probably however a more valid test of sex-resemblances 
provided by Terman and Miles's (1936) M—F Test. We will 
summarize briefly some of their work, and leave on one side the many 
Applications of empirical techniques to other fields. 
1 217. Terman and Miles’s investigation of sex differences.— Very 
arge numbers of items which seemed likely to differentiate the 


с were tried out on men and women, and those which were found 
© be significantly related to sex were incorporated in the two 
female) marks 


parallel forms of the. M-F Test; + (male) and — (fem 
Were determined for each response to each item, but in the final 
Version every item is weighted equally so that the scoring process is 


fairly easy. The range of total scores found among men 1s +2 
ans being + 52 


to — 100, among women + 100 to — 200, the medi 

and — 70. Thus there is a good deal of overlapping, indicating 
that some men possess interests, attitudes, etc. more feminine than 
the average woman, and vice versa. Highly meaningfu? differences 
are found between the average scores of differeht groups; €.8. 
college athletes are more masculine than the norm ; persons out- 
standing in their career, especially in engineering Or scientific 
careers are also superior ; artists and theological students are below 
the norm, and passive male homosexuals tend to approach the 
feminine median. Among women similar differences occur, the 
most domestic types yielding the lowest scores. These and other 
quantitative results, combined with qualitative interpretation of the 
best differentiating items, provide a mass of interesting data on the 
psychology of the sexes. " 

218. Discussion of interest blanks.—We may conclide, then, that 
empirical methods do give decidedly better results when applied 
to Likes and Dislikes, or to material which was logically selected 
as was Terman and Miles’s, than they do with word association and 
Cross-out tests. We also find in Strong’s and Terman’s tests 
(unlike the Wyman-Kelley tests) that there is a fairly clear meaning- 
ful relation between the marks for many of the responses and the 
interests to which they have been proved, empirically, to correspond. 
The highest marks undoubtedly occur in responses which one might 
expect to be characteristic of the various vocational or sex groups. 

919. Actually there is a slight disadvantage in this, since it 
permits а testee to fake his responses if he is aware of the object of 
the tests. Steinmetz (1932) has demonstrated that not only can a 
testee obtain an A or B rating on any vocational interest which he is 
intentionally simulating. but that also his scores on half the other 
interests are seriously distorted by this faking. And Kelley, Miles 
and Terman (1936) discovered that a man could, if metivated to do 
so, make himself out considerably more feminine than the average 

must be allowed that these 


woman, or vice versa. Nevertheless 1 


tests are much less liable to faking than are ‘most personality 
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20. We are however entitled to ask, what are these bo 
measuring? The various resemblance scores do not represen 


room for psychologists who try to understand, and not merely to 
measure, human traits, The dangers of relying solely on empirical 
treatment are well illustrated by an experiment of Burnham and 
Crawford (1935). They obtained a set of purely chance scores on the 
Bernreuter Inventory and Strong Interest Blank, by throwing dice 


Finally, we would point out that these empirical methods are 
tremendously laborious, and so consuming of time and money that, 


С. MULTIPLE FACTOR ANALYSIS OF VOCATIONAL INTERESTS 


221. One of the lines of advance should certainly be through 
factor analysis. It is obvious that there must be great overlapping 


of his six factor scores his interests in an indefinite number of more 
specific fields. 1 Oreover, better tests of such fact 
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th 222. Gundlach and Gerum's investigation.— From inspection of 
The interest inter-correlations, Gundlach and Gerum (1931) deduced 
P. main types; Social, Intellectual, Technical, Creative, and 
ysical Skill interests, together with several sub-types. These 
Were found to be only moderately self-consistent and discrete from 
Опе another. It was shown that each specific vocation possessed 
i distinctive pattern of interests on these types. А more exact 
echnique of analysis is provided by factorization. ^ 
298. Thurstone's investigation.—Thurstone (1931b) himself 
actorized the scores of 287 testees on 18 interests, and found four 
main factors which seemed to correspond in a general way to interest 
in science, in language, in business and in people*. A few of the 
Scores however still showed prominent specific factors, ie. their 
variance was not fully accounted for by these four factors. Hence 
the next step, the prediction of many different scores from combina; 
tions of a few factor scores, has not yet been attempted. It is 
interesting to note the close correspondence between these empirically 
determined factors and the four types—theoretical, aesthetic, 
economic and social—which Spranger (1928) derived from logical 
and intuitive considerations. It reinforces our suggestion that more 
rapid progress might be made through the co-operation of “ arm- 
chair” psychology with experimental testing and statistics. 
. 224. Other factorial studies.—Strong (1934) repeated the analysis, 
including six more interest scores, and obtained a distinctly different 
set of five factors, which were less easy to identify. A partial 
explanation of this result may be his failure to rotate his axes 
appropriately, but it may also indicate (what we pointed out in 
§ 88) that the factors are not universal in that they are limited by 
the particular set of measures which are factorized. A further 
difficulty is suggested by Carter, Pyles and Bretnall’s (1935) analysis 
of 23 interest scores, obtained from 133 youths of medium age 16} 
years. Here the factors resemble but do not coincide with either 
Thurstone’s or Strong’s, and they probably reflect the difference 
between the organization of interests among adolescents and adults. 


VII. DISCUSSION AND CONCLUSIONS 
A. VERBAL TESTS AND RATINGS 


„ We have described a considerable number of, often very 
ingenious, techniques which have been deévized for researches on 
personality and social psychology ; and those specifically mentioned 
or illustrated are, it should be remembered, but a small, though 
we hope representative, sample of the techniques available. In 
spite of their diversity, both of form and of purpose, several general 
conclusions may be drawn which appear to apply to ай of them. 


* One incidental result worth mention war, that the Psychologist's interest 
score gave 2 zero correlation with the “ Interest ir People” factor. 


Thurstone regards this as quite probably true. 
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226. Contrast between verbal personality tests and tests of 


abilities. —First it is clear that none of these tests can claim to measure 
psychological variables such as traits, attitudes or interests with the 
same degree of objectivity and accuracy that are achieved by tests 
of abilities. At least one reason for this is obvious, namely that 


agree to be mechanical in nature*. Whereas behaviour of ап 
emotional or conative character cannot be so easily specified ; there 
is considerable room for disagreement as to whether or not such and 
such an action (either bodily or verbal) is or is not an expression of 
introversion, of aesthetic taste, of trustworthiness, etc, We have 
seen how difficult it is adequately to define the content of an attitude 
; Still more ambiguity is likely to occur among 


, but are expressed 
only in his personal reactions and in the impressions he makes upon 
others. 8 

227. Difficulty сї quantification of traits.—Not only do affective 
and conative traits Possess less objective “ subject-matter ” than 
abilities, they are also less amenable to consideration as uni-dimen- 


sional variables. Certainly different people differ widely in their 
sociability, tolerance, etc.; but it seem 


cal person oniy 
shade, and to be so bound up with the rest of his personality that we 


and interests with which we are concerned. It would hardly be 
Í human nature without such 


When therefore the results of investigations reported in the 
previovs five chapters appear disappointing, or Progress seems slow 
In,comparison withthe efforts expended, it may simply be that our 


anagerial, teaching, etc., do Dot, of со 
; and these are quite as hard to define 


urse, produce 
and measure 


| 
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Instruments are at present inadequate or inappropriate for the 
Problems to which they are applied. Instances are afforded by 
Landis’s observation that differences do exist between mental 
Patients and normals in questionnaire responses, though these 
erences are obscured in the total questionnaire scores § (175) ; and 

У the writer’s demonstration that more can be deduced from the 
features and external appearance than the results of ordinary rating 
experiments would allow ($100). At the same time there is so much 
evidence of improvements in these instruments (e.g. jhe substitution 
of graphic for numerical rating scales, the studies of the form of 
questions in attitude scales and personality questionnaires, etc.) that 
we certainly cannot set a limit to the future possibilities of quanti- 
tative methods in this field. Moreover, the poor tools at our present 
disposal have already revealed much which was unrecognised in the 
days of purely qualitative observation and interpretation of human 
phenomena ; thus there is good reason to hope for greater advances 
to come. 

998. Advantages and disadvantages of verbal over behavioural 
tests.—It is largely owing to the indefiniteness of the behavioural 
content of traits, attitudes and interests, that verbal methods have 
been so extensively developed. Words are actions in miniature. 
Hence by the use of questions and answers we can obtaiu information 
about a vast number of actions in a short space of time, the actual 
observation and measurement of which would be impracticabie. 
Not infrequently, also, a trait is more directly expressed in subjective 
feelings, or in the effects it makes on other people, than it is in 
concrete actions, so that the verbal opinions of the possessor of the 
trait or of his acquaintances are the only possible sources of informa- 
tion about that trait. But unfortunately, words, though originally 
the correlates of actions, are much more Eeneralize& and abstract 
than actions ; they are interpretations rather than descriptions, and 
are fraught with ambiguities. The questions which are put in 
attitude scales, in ratings, personality inventories and interest 
blanks, often mean different things to different individuals, and the 
responses are similarly equivocal. Our pious hope that the inclusion 
of a large number of questions in each test, or that the use of analytic 
ratings filled up by several raters will cancel out the errors inherent 
in any one question and answer or rating, is seldom justified. For 
there is ample evidence in the halo phenomenon, the corresponding 

heriomena in self-rating tests, and the general '5 likingness ” factor 
in interest blanks, that such errors may „be constant rather than 


variable. 


229. Ne 


our opinion a majo 


ed for more careful psychological, study of the tests. —In 
r line of advance will consist in a more careful 
analysis of these complex subjective factors bearing on the inter- 
pretation of test questions and the psychological significance of test 
responses. А useful start might be made with a thorough intro- 
spective study, in the German tradition, of the méntal processes 
involved in rating, in answering Yes or No, Like or Dislike, etc. The 
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trend of thought in American investigations has always been opposed 
to such subjective approaches. Symonds (1931) is representative 0 
the majority when he states, in effect, that subjective reactions to 
particular test items need not concern us because the significance 
or validity of the test as a whole should always be determined 
empirically ; that the resistances and inhibitions aroused by many 
of the tests do not matter to the psychometrist, since the psycho- 
logical variables which such tests measure should be established by 
objective correlational comparisons with other variables rather than 
by subjective speculations or “arm-chair” considerations. Yet 
it may be precisely this neglect of subjective analysis which 15 
responsible for the poor empirical validity of so much of the work, 
and for the apparently slow speed of advance. Certainly “arm- 
chair " psychology alone will not solve our problems, but it is, anc 
always has been, an essential precursor to the most fruitful expert- 
mental research. 

230. Difficulties due to variability and ease of simulation of 
personality traits.—A further fundamental difference between an 
ability and an emotional characteristic is the greater variability 
in the latter from time to time. The level which we maintain 
in performing a séries of arithmetical problems is much more 
constant than our moods. Even the habitually depressed person 
has at times been elated, the happy-go-lucky man is ‘sometimes 
cautious. Thus it is possible for us to simulate almost any trait 
or attitude with a fair degree of success. In everyday life we normally 
adapt out personalities to some extent to the company we are 10, 
for instance, showing different characteristics at work, at home, 
and at a party. It is only natural then that the testee should adapt 
his personality into a mould which he regards as appropriate to an 
experimental test situation. Often in everyday life a clever 
observer can-penetrate through our disguises and see whether or not 
our moods and sentiments are genuine. But the personality test 
can hardly lay claim to similar insight ; it can usually only recor 
the testee’s words or actions af their face value. And when a test 
does not even record actions, but only descriptions or interpretations 
of actions, the testee may falsify his responses with the greatest O 
ease. All the tests and ratings that we have described are open to 


this admittedly serious defect, though those described in Chapter VI. 


(word associations and interest blanks) are relatively free from it. 

231. Complete dependence of the tests on the testee’s Einstellung.— 
Itis clear then that the methods with which the Report deals are likely 
to be rendered valueless should the testees or raters have any motive 
for falsification ; and tnat it is seldom possible to detect when such 
falsification has occurred. To some extent tne object of the tests, 
i.e. the traits or attitudes at which they are aimed, can be hidden, 
but not sufficiently for them to be employed for vocational selection, 
or for other. purposes where personal advantage enters, Such 
disguise is still less possible: with ratings, unless Olson's empirical 
technique be used. We are therefore entirely dependent upon the 
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good faith and the co-operation of the testees or raters. Either they 
И be convinced that candid responses will be advantageous to 

em personally, or else persuaded that the investigation is of 
Scientific value and that candidness will not be in any Way dis- 


advantageous. 
There is no reason to suppose that deliberate falsification played 
have described. But 


‹ a large part in any of the investigations we 
unwitting distortions, whose effects are similar, are likely to be 
ubiquitous. The average rater always tends to miark his friends 
too high on desirable traits because he quite genuinely regards them 
as superior; the neurasthenic attributes to himself numbers of 
emotional weaknesses which the ''tough-minded " rejects in all 

| Sincerity. The term personality derives etymologically from “а 
Mask,” , It embodies the notion that we are playing a part insome 
drama. And we seldom realize the extent to which the traits and 
attitudes that we display to the public gaze, or express in these 

tests, or admit to ourselves, are assumed. Whether even the 

psychoanalyst can penetrate to the true inner core of our tempera- 
ment or character is doubtful ; certainly no test can do so. 

232. Value of the results.—The conclusion follows, not that 
tests and ratings are useless, but that their results must always 
be interpreted in the light of their origins; that the probable 
subjective attitude of testee or rater should be taken into account 
in deciding on their significance. Their “ fictional” nature may 
be а merit rather than a defect. For instance, the psychologist 
or psychiatrist who possesses other sources. of information about 
an individual who has answered a personality inventory may, 
by comparing the latter with the former, be able to discover the 
individual’s inhibitions, self-deceptions and ego-ideals, which are of 
vital importance in diagnosing and treating him. ` Discrepancies 
between tested attitudes and observations: of behaviour, either of 
individuals or of groups, may be equally revealing. From this point 
of view, qualitative analysis of the answers to particular test items 


May be more valuable than the quantitative aggregate scores. 
almost always do show some objective 


Nevertheless, the scores alone à 
validity when correlated with other criteria (cf. §s 59, 110, 178, 215, 
etc.), especially when careful attention is paid to the conditions under 
which they are obtained, and when the traits or attitudes tested 
are not of a highly intimate or personal character. So that though 
it is unwise to take them at their face value as direct measurements 
of these traits, they may justifiably be regarded as partial indicators. 
For instance, they may be combined with other tests (liable to other 
types of error), Or with other ratings, in a composite battery which 
will measure the traits with quite a high degree of validity (cf. 
Vernon (1934a)). ath @ 
933. Future research.—Future progress will depend then mainly. 
on the better adaptation of the instruments to the fields in which 
lied. As suggested above, introspective 


they are already being арр. у 4 
and qualitative analyses of the test situations are needed ; but there 
н 
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is already room for a large amount of controlled experimental 
research on the form of the tests, wording of items, influence of 
instructions, etc. The.mere piling up of more tests, and more 
tabulations of group differences in test scores or test items—of which 
so big a proportion of American investigation consists—is of very 
little value. More intensive studies of particular social-psychologic 

problems by these verbal tests and by other techniques (observa- 


tional and clinical) would be especially fruitful in throwing light on the | 


advantages and disadvantages of the verbal tests. ; 
234. We are inclined also to recommend very thorough examina- 
tion of a verkal method which received only incidental mention in 
the Report, but which might prove to be relatively free from most 
of the difficulties we have just described: that is, the giving „of 
ratings, either on general traits or attitudes or on more specilic 
behaviour characteristics, by a psychologist or psychiatrist on the 
` basis of an interview with a ratee or with acquaintances of the гаїеё. 
‚ The ratings assigned to vocational guidance candidates at the 
National Institute ot Industrial Psychology (§108) approximate 
to this type ; the Vineland Social Maturity Scale (§ 88) is an eve? 
better instance, in view of the objective nature of its componen 
items. In the interview the ratee, or his acquaintances, would be 
asked to supply concrete information as'to his thoughts, feelings, 
and actions which are pertinent to the trait or item being rated; 
'and the psychologist would judge whether this information indicate 
а high, medium, or low rating. Misunderstandings of the test 
material would be enormously reduced by this method, and witting 
or unwitting falsifications or disguises might be penetrated by a2 
experienced examiner. The examiner could not, of course, hope to be 


entirely immune himself from halo and other prejudices (e.g. 2 
Freudian arid an Adlerian might give different ratings to the same | 


person), But he extent of such errors could readily be investigate 


experimentally ; and Doll’s (1936 a, b) results suggest that it may/ Þe 
very small. When different examiners, often interviewing different | 


informants, filled in the Social M. aturity Scale, the correlations betwee? 
the scores which they assigned to the same patients amountdd 40 
about + 0-90. i 
B. MULTIPLE FACTOR ANALYSIS | 
285. Uses of factor analysis—Although we have advocated 
above a more careful study of the methods by which verbal data are 
obtained, this does not preclude further analysis of the data itself, 
by factorization or other statistical techniques. We should indee 
remember that no amount of statistics can improve on data which 216 
Inaccurate in the first place, and that the factors extracted 1100 
Tatings or test scores are as impregnated with errors as are theS 
ratings or scores, Nevertheless, the factorization of personality 
inventories’ was of considerable assistance in determining the natur? 
of the errors ($171), and statistical treatment of ratings showe 
promise of enabling us to separate off the halo effect in rating? 
(8118). Elsewhere it was found that factorial analysis was usef 


| 
| 


109 


ns overlapping variables such as the nineteen tendencies 
t e Boyd Questionnaire ($ 146), and the twenty or thirty vocational 
ee measures derived from the Strong Blank (§ 998-994) ; 
i in analyzing poorly defined psychological conceptions such as . 
e Fe оп entree into distinctive components ($147). In 
Б er words, factorial techniques constitute a powerful instrument 
SOL the generalization and systematization of test results. It is 
impossible to study personality or social phenomena without classi- 
ying and analyzing, and we are far too apt to do jo without any 
Scientific backing. For example, we distinguish types of interests 
Or of abnormal mental states, and assume that these are discrete from 
Опе another, or that they are accompanied by various personality 
traits, mainly on the grounds of subjective generalizations. And we 
often imply the existence of general traits or factors, for which there 
15 no real evidence, when we make predictions about people, either at 
Clinics, їп vocational guidance, or in everyday life. By means 
of factor analysis we should ultimately be able to do all this 
objectively. i 
236. Limitations of factor analysis—In introducing the topic 
of factor analysis (§ 61) we implied, as do most: statistical psycholo- 
gists, that such analysis would reveal the underlying structure of the 
Personalities to whom our tests are applied. But we can see now 
that this claim is somewhat presumptuous, and that it would be 
safer to regard analysis merely as revealing the logical structure of 
theappliedtests. Kelley (1935) and others do indeed talk of isolating 
the unitary traits or basic elements of personality by means of these 
techniques and hope eventually to establish a relatively small 
number of independent dimensions or elements, in terms of 
which any test may be classified, and any personality completely 
described. Now this conception of personality as a compound of a 
few elements is very attractive to the psychometrist, who wishes to 
measure people as economically as possible ; but it is a conception 
for which not the slightest justification may be derived from biology, 
&eneral psychology ог psychoanalysis. As Burks (1936) shows, 
personality is more likely to be made up of a multitude of complexly 
inter-related dispositions than of a few discrete traits and abilities*. 
Nor do the results so far obtained by factorial investigations lend 
much support to the geometrical view. 
We have already seen that the factors extracted are governed 
by the particular set of tests that the investigator chooses to apply 
бз 88, 994). It is no longer true, as it was when Thurstone's 
was first put forward, that the insertion of a few addi- 
tional tests into the set, or the omission of 2 few, entirely álters the 
resultant factors ; for by rotation re we can maintain the compo- 
sition of the factors approximately constant. Nevertheless, it is 
obvious that the factors can only cover those facets of personality 


technique 


* The factorist's offer to supply inter-correlated factors instead of the 
usual independent (orthogonal) axes would not, of course, help at all to close 
the вар between these two contrasted views of personality. 
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which are represented in the test battery; hence their universality 
is limited by the comprehensiveness of the sampling of human traits. 
Similarly, in the field of interests, the main dimensions will inevitably 
vary with the particular measures included, until such time as а 
method is devized for specifying and measuring the whole range of 
interests. Kelley (1935) has realized this point and has therefore 
attempted to prepare a complete classification of vocational 
interests. "n 

We have sen also, that the available tests cannot be accepted 
as direct measures of personality, owing to their distortion by 


is found among different factorizations of ratings (§ 119), or different 
factorizations of self-ratings (§ 149) and attitude tests (§ 65), there is 


to “w”, general adjustment, sociability, radicalism, etc., are 
manifested in behaviour. Clearly then none of the factorizations 
we have doscribed can claim to have disclosed the real elements of 
personality. } 

287. Conclusion.—More fundamental is the objection that, while 
the test inter-correlations are consistent with the extracted factors, 
they do not prove that these are the only possible factors. It is 
now generally admitted that an infinite number of different 
factorizations of the same set of variables is possible (cf. Thomson 
(1935)) ; and that as their relative merits cannot be decided solely 
by mathematical considerations, the most suitable alternative must 
be selected on logical grounds. Holzinger (1936) states that : “ What 
the factorist seeks is the simplest, most parsimonious, and most 
useful pattern for the interpretation of the underlying variables. 
Factors are “ a possible way of thinking about mental traits." We 
would conclude then that they should not be regarded as faculties 
or entities existent in the personalities tested, but as convenient 
descriptive categories, which enable us to generalize and simplify 
our test results, and to make predictions about people with а 
maximum degree of efficiency. Я 

Although then, there is по obligation to accept any investigator 5 
factors as representing the true dimensions of human nature (even £ 
is, from this point of view, a convenient fiction), yet it would make 
for more rapid progress if different investigators would take more 
account of the work of their predecessors, by choosing for their 
main axes factors that are already widely established, If, in future 
Studies, more care is tuken in obtaining comprehensive sets of tests, 
chosen on {һе basis of some logical analysis of the field, and in 
connecting up as many clusters of these tests as Possible with 
proviously determined clusters, we may hope before long to achieve à 
fairly complete classification of all our psychological measuring 
instruments which would be of the utmost value in many branches 
of pure and of applied psychology. : 


i 
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Мо. 59.—Sickness amongst Operatives in Lancashire Cotton Spinning 


Mills (with special reference to the Cardroom), by A. Bradford 
Hill. (1930.) 


. 60.—The Atmospheric Conditions in Pithead Baths, by Н. M. Vernon 
and T. Bedford, assisted by C. G. Warner. (1930.) 


*No. 61.—The Nervous Temperament, by Millais Culpin and May Smith, 


*No. 


No. 


(1930.) 4s. 6d. (4s. 74d.) 


. 62.—Twe Studies of Absenteeism in Coal Mines. (I)—The Absen- 


teeism of Miners in Relation to Short Time and other Condi- 
tions, by H. M. Vernon and T. Bedford, assisted by C. G. 
Warner. (II)—A Study of Absenteeism at certain Scottish 
Ceilieries, by T. Bedford and C. С. Warner. With Appendix 
by E. P. Cathcart and James Taylor. (1931.) 


. 63.— Inspection Processes in Industry: a Preliminary Report, by 


S. Wyatt and J. №. Langdon. (1932.) 


. 64.—A Classification of Vocational Tests of Dexterity, by A. E. Weiss 


Long and T. Н. Pear. (1932.) 


. 65.— Two Studies in the Psychological Effects of Noise. (I)—Psycho- 


logical Experiments on the Effects of Noise, by K. G. Pollock 
and F. C. Bartlett. (II)—The Effects of Noise on the Per- 
formance of Weavers, by Н. C. Weston and S. Adams. 
(1932.) 


‚ 66.—An Experimental Study of.Certain Forms of Manual Dexterity 


by J. N. Langdon. (1932.) 


.67.—Manual Dexterity: Effects of Training. (I)—Transfer of 


Training in Manual Dexterity and Visual Discrimination, by 
E. M. Henshaw. P. Holman and J. №. Langdon. (II)— 
‚ Distribution of Practice in Manual Dexterity, by E. М. 
Henshaw and P. Holman. - (1933.) 


. 68.—Tests for Accident Proneness, by E. Farmer, E. G. Chambers 


and Е. J. Birk, (1933.) 


. 69.—Incentives in Repetitive Work. A Practical Experiment in a 


Factory, by S. Wyatt, assisted by L. Frost and F. С. L. 
Stock. (1934.) 


. 70.—The Performance of Weavers under Varying Conditions of 


Noise, by Н. C. Weston and S. Adams. (1935.) 


. 71.—The Physique of Man in Industry, by E. P. Cathcart, D. E. К. 


Hugh^s and J. G. Chalmers. (1935.) 


.72.—Incentives. Some Experimental Studies, by C. A. Mace. 


(1935.) 5s. 6d. (5s. 9d.) 


. 73.—The Acquisition of Skill: an Analysis of Learning Curves, by 


J. M. Blackburn. (1936.) 


b 74.—The Prognostic Value of some Psychological "Tests, by E. Farmer 


‘апі Е. G. Chamters. (1936.) 


75.—Sickness Absence 2nd Labour Wastage. Part І, by May Smith — 
and Margaret A. Leiper. Part 11, by Major Greenwood and 
May-Smith. (1936.) 6s. (65. 3d.) 


6.—The Warmth Factor in Comfort at Work. A Physiological 
Study of Heating and Ventilation, by T. Bedford. (1936.) 
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*No 
NI Fati 
У at: 
TANS nd Boredom in Repetitive Work, by S. Wyatt and 
. Langdon, assisted by F. G. L. Stock. (1937.) 
6s. 6d. (6s. 8d.) 


No. 78 
- 78.—A В, Ri 
orstal Experimént in Vocational Guidance, by Alec odger. f 


(1937.) 
АП IIIS 
ater vestigation into the Sickness Experience of London 
DEC Workers, with special reference? to Digestive 
rbances, by А. Bradford Hill. (1937.) К 


Ко. 79, 


No. 8 
0.— Toxicity í : a D s 
Published Industrial Organic Solvents : Summaries of 
directi ed Work, Compiled by Ethel Browning under the 
Solve on of the Committee on the Toxicity of Industrial 
nts. (1937.) (Under revision) 


No. 8 

:81.— Th 
e E 

cate of Conditions of Artificial Lighting Оп the Per- © 

No. 82. T ce of Worsted Weavers, by Н. C: Breston, (1998) 2e 

~The Machi ^ : 
Sore ine and the Worker: @ Stud? of ‘Machine-Feeding 
es, by S, Wyatt and J.N. Langdon, assisted by F. G 


.S 
*No. 83 D tock, (1938.) 
d he Assesg EUG 
® а Sury ment of Psychological Qualities bY Verbal Methods : 

Questi €y of Attitude Tests, Rating Scales and Personality 
ionnaires, by P. E. Vernon. Gog)  8s.6d. (85: Ole 


No 
. 84 
75 
ча 
E. Fio Accident Proneness 2! 
er and E, G. Chambers. 


*N 
9. 85.—т 
——The R 
ecord; 
Tding of Sickness Absence i 


eport 
Кеа Бу a Sub-Committee О 
ага. (1944.) (Reprinted 


—A Stud 

ud 

Industry, by S Woa Sickness А ea Маго 
on, Nor era yatt, assisi by К. farriott, 1: 

(19. orah M. Davis, D. E. R. Н hes and Е. G. L. Stock. 

e vis, D. Е. R. Hug od. (1044) 


A Preliminary 
trial Health 

6d. (744.) 
o 


bsence among Women in 
уу. M. Daw- 


n Industry ( 


f the Indus 
1948.) 


“No. 86, 


ency—the 


No, 87 
(1945) 


~The в 
elati А 
Efe Illumination and Visual Effici 
ightness Contrast, by Н. C. Weston. 


No 
. 88 
.—A Study of Wo: 
men on War Work in Four Factories, БУ S 


Wyatt, assisted 
d b 
Davis, D. E. R- HX m Marriott, W. M. Dawson, Norah M. 
ghes and F. G. L. Stock. (1945) ш 
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0. 89.—Artifici 
c ^ 

Racial sua шна Edu ое ra on th 

ота: Col nals in Бе Om ту. ерот% on е 
*N ne, by ora Colebrook. (1946 ce, a Factory and а Coal- 
о. 90? T .) 1s. (15, 134.) 
1: he Incidence ОЁ Neurosis amo LI 
i ng Factory Workers, by Russell 


Fraser, with the Collaboration” of Elizabeth Bunbury. 


Barbara Danniell, M. Eli 
Za betti Barling, а. Estelle Waldin 


P. M: mp and Im 
ary KemP овеп Leg. (1947) 1s ба. (1s. 547. | 
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REPORTS CLASSIFIED ACCORDING TO SUBJECT MATTER. | 
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Hours of work, rest pauses, еіс. Nos. 1, 2, 5, 6, 24, 41, 47, 56, 8 
Dexterity. _Nos. 63, 64, 66, 67, 73. 
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Atmospheric conditions. Nos. 1, 5, 11, 18, 20, 21, 22, 23, 3 
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5, 61, 64, 
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Time and motion study, methods of work. Nos. 3, 7, 8, 9, 14, 15 
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Posture and physique. Nos. 15, 16, 29, 36, 44, 50, 71. 

Sickness and absenteeism. Nos. 51, 54, 62, 75, 79, 85, 86, R8, 89. 
Noise. Nos. 65, 70. 
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Mining. Nos. 39, 51, 60, 62. 

Metals and engineering Nos. 1, 2, 3, 5, 6, 15. 
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Boots and shoes, Nos. 10, 11. 

Pottery. No. 13. 

Laundry. Мо. 22. 

Glass. No. 24. 

Printing. Nos. 16, 26, 24. ۴ 
Transport. Nos. 79, 84. 
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